A/B Testing
Let’s consider a test goal of designing web pages using marketing strategy, to get maximum online audience & placement for marketing ad creative. (i.e., groups of words used in text promotion.)
A/B testing or bucket testing or split-run testing is utilized to test “likeness performance of a text/ ad-message” by control group and test audience, for two finalized designs of promotional content; for the vector under consideration.
At ideation stage, copywriter/marketer may narrow down two final designs to be creatively scaped on landing page of website, email template, webpage advertisement or ad-word ad, pop-up message on end user screen. A set of reliable measurable data on Key Performance Indicators KPIs, “Return on Investment”, “Click Through Rate”, “Average Revenue Per Session”, “Revenue Per Visitor”, is put to scanner through “A/B testing”; to select better design for online promotion: as perceived by Control & Test audience.
ROI: Return-On-Investment, is defined as “ratio of Profit earned by investment divided by total cost of investment.”
CTR: Click-Through-Rate, is defined as ratio of “select number of users clicking on hyperlink against total number of users reaching webpage, email, ad-creative.”
ARPS: Average Revenue Per Session, is defined as, “average amount of earning per web session or earning registered in visits to the website”, and calculated by “dividing the total revenue of in web session by the total number of website visits.”
RPV: Revenue Per Visitor, is defined as, “the amount of money generated each time a customer visits a website”, and calculated by “dividing the total revenue by the total number of visitors to a website”.
The two variants— “Variant A” or “Variant B”, are put to test using one or more method-- “AB Lift Test”, “Z-Score test” & “statistical hypothesis testing” or "two-sample hypothesis testing". The scores are analysed to decide about effective outlook to be shared in production.
One may complete A/B testing by following these steps:
1. Think and decide on objective to be achieved from online promotion, including learning, training-needs to-be-met, & promotion to be shared.
2. Determine Key Performance Indicator of customer success to be utilized Return on Investment, Click Through Rate, Average Revenue Per Session, Revenue Per Visitor.
3. Decide on ad-timeslot and web-locations with highest traffic volume and that most-cost to improve optimization process. It is important to note that spots with lower traffic volume require longer campaign run periods to get standard outcomes.
4. Decide on elements of campaign design to be tested in A/B testing. Usually, a change in single element is recorded in A/B testing to get authentic results. Changes in banner- header or footer, different placement of message with varied selection of words- short precise or simple statements in user comprehendible language, may be put to test.
5. Monitor data generated in the test run and highlight trends resulting in better KPIs higher traffic, higher conversions, more revenues and better ROI%.
Precautions in A/B Testing
1. Sample Size: Sample size used in A/B testing should be sufficiently large, to generate authentic user opinions, on changes in content design; as means to effective communication. A sample size between 100 to 1000 user observation be considered a good fit. Further confidence level, statistical significance may be observed to interpret finding from statistical tests used in A/B testing.
2. Blocking: Blocking is act of accounting biases in sample group to maintain randomness in the sample. The first step is to form homogenous columns (basis gender, age, geographic location, occupation) in groups, followed by a randomized test within each group to check trends within the group. Blocking adjusts randomness biases originating from homogenous similar groups of respondents, in sample- control group (responding to campaign design without altered design variable) and test group (responding to campaign design with altered design variable), with fresh randomized testing. Blocking is effective for 2-3 variable and re-randomization is preferred in A/B Testing, when dependent covariate is higher in number in sample.
3. Adjust Test run duration with traffic volume, to generate authentic trends. A/B test run with lower traffic need to be continued for a longer duration of time to result in authentic customer preferences insights on campaign design.
4. Testing for multiple changes in campaign design, using A/B testing, may lead to spurious correlation in results.
Tests used in A/B Testing
A/B Lift: Lift in A/B Testing is defined as, “the percentage difference in conversion rate of control design and a successful test treatment.”
Application:
1. Incremental Lift in Revenue Per Session in A/B Testing: may be calculated by initially subtracting revenue per session of the control design from the test treatment campaign. The resultant figure may be divided by the revenue per session of test treatment and then multiplied by 100, to get percentage estimate of Incremental Lift in Revenue Per Session in A/B Testing.
2. Cost of Running the Test: Since, Control and Test group form basis to A/B testing, and are implemented on 50%-50% capacity: it can be inferred that control, did not generate revenue in the A/B testing. Cost of Running the test, is calculated by, “multiplying the revenue generated for control treatment by the incremental revenue per session (%).”
3. Identification of Multiplier for Test Distribution: A multiplier is required to be identified, to estimate of value required to get value of test distribution (Test/Control) equal to 100% traffic volume. Basis Test and Control group used in 50%-50% campaign runs during A/B Testing, multiplier of 2.0 is considered for evaluation purpose.
4. Estimate the Value gained if Test Campaign is rolled to a 100% traffic-users: The estimate of value gained on idealizing winning test campaign to 100% traffic user is calculated by, “multiplying cost of test campaign with value of Multiplier realised for Test Distribution in earlier step.”
5. Value Gained from Testing Period: To calculate the value gained by the winning test only during the testing period, “subtract the cost of running the test from the value gained if rolled out to 100% traffic volume.
6. Forecasting Incremental Revenue Gain: It can be safely concluded from above steps that treatment design wins and is rolled out at 100%. Further the winning treatment campaign would be earning an incremental revenue for a set period of time known as forecasted period for successful run decided using market condition, nature of product or service and campaign design. Forecasting Incremental Revenue Gain can be done by, “dividing the value of the selected winning test design at 100% traffic volume, by the number of days of test design run. Further, multiply the daily gain by the forecasted number of days.”
7. Overall Program Value: Overall Program Value need to be calculated by estimating the total value of A/B testing program including cost of all test design campaigns and statistical testing including hypothesis testing implemented for programme. The operational cost is then reduced from program cost achieved above to reach the estimate cost of A/B Testing.
Z-Score in A/B Testing:
Z-Score is standardized score used for data under normal distribution, that describes the number of standard deviations a sample element is from its mean. In case of A/B testing website or online respondent is considered as observation parameter. The Control group are represented through bell-curve in plot and Test group & sample set of all website visitors represent the second bell curve or the shift in the bell curve from sample mean.
Hypothesis Testing
Two alternative hypotheses are tested for statistical significance and impact of A/B Testing, assuming:
i. Null Hypothesis: H0: Default campaign design generates a conversion rate equal to x% (default).
ii. Alternate Hypothesis: H1: New variable in campaign design generates a conversion rate greater than x%.
Here, critical observation is expressed using
p-value= statistical significance of test=probability of negating Null hypothesis when Null Hypothesis is true.
Alpha(α)= false positive rate; or significance level; or type I error.
Beta(β)= false negative rate; or type II error.
power: true positive rate; (1 – beta). 1 - alpha: true negative rate.0
Likely outcome of Hypothesis testing:
No error
Type I error: In-correctly rejecting Null Hypothesis, when H0 is true and accepting alternative hypothesis as likely outcome of Hypothesis testing.
Type II error: In-correctly accepting Null Hypothesis, when H1 is true and rejecting alternative hypothesis as likely outcome of Hypothesis testing.
Here, key consideration is to determine, whether the difference between two sample distribution; i.e., the original default control design & the new test design with change in one variable; has occurred as random chance: or, there is a presence of bias, accounting for the difference. Statistical reasoning using three critical parameters Confidence, Statistical significance and Significance may be utilized to remove speculation about bias in observation of control and test design.
Confidence intervals measure the degree of uncertainty or certainty in a sampling method. Or a confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. Or, a confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are often used with a margin of error. Normally confidence level of 95% to 99% is desirable for test statistics to reach meaningful interpretation.
Statistical significance is validation of a result to be attributable to a specific cause, after testing or experimentation, is achieved on sample data. In statistical hypothesis testing, a result has statistical significance when result is an unlikely outcome, given the null hypothesis. Statistical Significance is measured in p-value and normally p-value for valid statistical significance is estimate as p-value=0.05. The result hold statistical significance, when p ≤ α.
The significance for a hypothesis test is a value, for which a P-value, less than or equal to, is considered statistically significant. The significance value may be shared as 0.1, 0.05, 0.01 depending on nature of sample universe and study type. The significance values correspond to the probability of observing an extreme value in sample by chance.