A/B Testing with a Hypothesis Testing Framework

Introduction

A/B Testing & Hypothesis Testing:

In the realm of data-driven decision making, A/B testing and hypothesis testing stand as powerful methodologies that fuel innovation and guide strategic choices. A/B testing has long been celebrated for its ability to compare different versions of a webpage, app interface, or marketing campaign, enabling organizations to optimize their digital presence. However, when combined with the rigor and structure of hypothesis testing, the potential for extracting valuable insights and making informed decisions is further amplified. In this article, we embark on a journey to explore the dynamic synergy between A/B testing and hypothesis testing, uncovering the unique strengths they bring to the table and how they complement one another.

Z-score

There are different types of statistical tests to explain difference, but today we will be going with a two-tailed z-test. A two-tailed z-test plays a crucial role in both hypothesis testing and A/B testing, serving as a statistical framework to validate or reject hypotheses and draw meaningful conclusions. In hypothesis testing, a two-tailed z-test allows researchers to evaluate whether there is a significant difference between two population means, without making any specific directional assumptions. It enables them to determine if the observed data provides sufficient evidence to reject the null hypothesis and support the alternative hypothesis. Similarly, in the context of A/B testing, a two-tailed z-test is employed to compare the performance of two variants and determine if there is a statistically significant difference between them. By calculating the z-score and comparing it to the critical value, A/B testers can make data-driven decisions on which variant outperforms the other or if there is no significant difference. Thus, the two-tailed z-test serves as a fundamental statistical tool that underpins the hypothesis testing framework and empowers A/B testers to extract actionable insights from their experiments.

Let’s Get Started!

Data Background

For today’s example we will be using sample data for a mobile game called “Cookie Cats”, obtained through kaggle provided by user Mürşide Yarkın.

From Mürşide’s post on Kaggle:

Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It’s a classic “connect three”-style puzzle game where the player must connect tiles of the same color to clear the board and win the level.

As players progress through the levels of the game, they will occasionally encounter gates that force them to wait a non-trivial amount of time or make an in-app purchase to progress. In addition to driving in-app purchases, these gates serve the important purpose of giving players an enforced break from playing the game, hopefully resulting in the player’s enjoyment of the game being increased and prolonged.

But where should the gates be placed? Initially the first gate was placed at level 30, but in this notebook we’re going to analyze an AB-test where we moved the first gate in Cookie Cats from level 30 to level 40. In particular, we will look at the impact on player retention. But before we get to that, a key step before undertaking any analysis is understanding the data.

Step 1: Load Necessary Data + Additional Prep

For this example, we will be utilizing Sigma’s upload csv feature to import two data elements:

Note: for this you will need to enable write access.

Our independent variable will be column “version” – whether users were placed in the condition of the first gate at level 30 (version = gate_30) or at level 40 (version = gate_40).

Our dependant variables will be:

  • retention_1: the percentage of players who are still playing after 1 day
  • retention_7: the percentage of players who are still playing after 1 week
  • sum_gamerounds: the total number of rounds played within the first 2 weeks

To easily switch among these dependent variables, we will create a control element.

  1. Create New List Control named “Dependent Variable”.
  2. Check: Required
  3. Uncheck: Allow Multiple selection & Show Null Option
  4. Add Values:
    – “Game Rounds”
    – “1 Day Retention”
    – “1 Week Retention”

  1. Add a new column titled “DV”: If([Dependent-Variable] = "Game Rounds", [sum_gamerounds], [Dependent-Variable] = "1 Day Retention", CountIf([retention_1]), [Dependent-Variable] = "1 Week Retention", CountIf([retention_7]))

Step 2: Calculate Skewness

Statistical skewness is a measure that describes the asymmetry or departure from symmetry in a probability distribution. It indicates the extent to which the values of a dataset or a probability distribution are concentrated towards one tail compared to the other.

  • Positive skewness: If a distribution has a positive skewness, it means that the tail on the right side of the distribution is longer or fatter than the left tail. The majority of the data points are concentrated on the left side of the distribution, while the right side has a few extreme values that drag the mean towards the right.
  • Negative skewness: If a distribution has a negative skewness, it means that the tail on the left side of the distribution is longer or fatter than the right tail. The majority of the data points are concentrated on the right side of the distribution, while the left side has a few extreme values that pull the mean towards the left.
  • Zero skewness: If a distribution has zero skewness, it means that the dataset is perfectly symmetrical. The left and right tails are of equal length, and the mean, median, and mode are all at the same point.


source

  1. Group by Version
  2. Inside Grouping calculate N for both groups | New Column in Grouping “N” → Count()
  3. Inside Grouping calculate the Mean for both groups | New Column in Grouping “SD” → Avg([DV]))
  4. Inside Grouping calculate the SD for both groups | New Column in Grouping “M” → Stddev([DV]))
  5. Outside of grouping create a new column “Deviation” | Power(([DV] - [M]), 3)
  6. Inside Grouping calculate a new column “Skew” | Sum([Deviation]) / (([N] - 1) * Power([SD], 3))

Step 3: Calculate Kurtosis

Kurtosis is a statistical measure that describes the shape of a probability distribution by quantifying the tailedness and peakedness of the distribution compared to a normal distribution. It provides information about the presence of outliers or extreme values in a dataset.

  • Excess kurtosis: Excess kurtosis refers to the kurtosis of a distribution minus 3. This measure is commonly used and allows for comparisons with a normal distribution. It can take positive or negative values.

    • Positive excess kurtosis: If a distribution has positive excess kurtosis, it means that the distribution has heavier tails and a sharper, more peaked central region compared to a normal distribution. This indicates the presence of outliers or extreme values in the dataset, resulting in a higher concentration of data points in the tails.
    • Negative excess kurtosis: If a distribution has negative excess kurtosis, it means that the distribution has lighter tails and a flatter central region compared to a normal distribution. This suggests that the dataset has fewer outliers and is less peaked.
  • Kurtosis independent of normal distribution: Kurtosis can also be considered without reference to a normal distribution. In this case, it measures the overall shape of the distribution without indicating whether it deviates from a normal distribution or not.

    • High kurtosis: A high kurtosis value indicates a distribution with heavy tails and a sharp peak. This indicates a higher likelihood of outliers and extreme values.
    • Low kurtosis: A low kurtosis value indicates a distribution with lighter tails and a flatter peak. This suggests a lower probability of outliers and extreme values.
  1. Outside of grouping create a new column “2nd Moment Prestep” | Power([DV] - [M], 2)
  2. Outside of grouping create a new column “4th Moment Prestep” | Power([DV] - [M], 4)
  3. Inside of grouping create a new column “2nd Moment” | Sum([2nd Moment Prestep]) / [N]
  4. Inside of grouping create a new column “4th Moment” | Sum([4th Moment Prestep]) / [N]
  5. Inside of grouping create a new column “Kurtosis” | [4th Moment] / Power([2nd Moment], 2)

Step 4: Calculate Z-Score

Finally on to conducting a z-test which will inform us if there are any significant differences.

  1. In a table summary, create a new value called “Two Tailed Z Score” using the following formula: (First([M]) - Last([M])) / Sqrt(((Power(First([SD]), 2) / First([N]) + (Power(Last([SD]), 2) / Last([N])))))
  2. In another table summary, create a new value called “P-Value (Z Score Table)” using the following formula: Lookup([Z Score Table/P-Value], Round([Two Tailed Z Score], 2), [Z Score Table/Z-Score])
  3. In another table summary, create a new value called “Significance” using the following formula: If([P-Value (Z Score Table)] < 0.05, "Significant", "Not Significant")

Now we will explore our three dependent variables. As we cycle through we see that there is no statistical difference among “Game Rounds” or “1 Day Retention”, however there is for “1 Week Retention”. Namely, we can conclude that there is a significantly higher retention rate (19.0%) when the gate is at level 30 compared to when the gate is at level 30 (18.2%)

1 Like

Added HypothesisTesting