When and How to Utilize a ‘Do No Harm’ Experimentation Framework

If there was an oath for A/B/n testing professionals, the first line would likely be something akin to “research, ideate, experiment, act.” Next, maybe surprisingly, the second line would be borrowed from the Hippocratic Oath, “do no harm.”

As an experimentation professional, you work hard to find a pain point in our business’ customer journey, engineer the best potential solution, and roll it out to your users. You are using data to inform your, potentially big, change. But what about when there is a branding change or site redesign? Your carefully crafted testing pipeline is out the window and changes are happening, no matter what. It may seem like there is nothing you can do, you can’t feasibly test every element, but you know it’s still important to understand the impact of these changes on UX and KPIs. Here’s where the second line of the oath comes into play. You can assess the mitigation of risk through a framework coined “Do No Harm” (DNH). Performing this kind of test allows you to monitor if these changes are adversely affecting your business.

Do No Harm Use Cases

The most obvious use case for a DNH test is in the event of redesign, especially if you already believe the UX is optimized. You will want to monitor if the sitewide changes are negatively impacting your business KPIs.

DNH testing can also be specifically helpful when changing forms on your site. Any type of form (e.g., lead form, email newsletter sign-up, free product tour) is sensitive to change. Adding features may cause friction and result in few users completing the form. Conversely, taking away fields that introduce friction, can lead to a statistically significant increase in conversion rate. But you need to test to be confident in your form change.

‍

The Statistics of a DNH Framework

A fundamental aspect of testing is creating null (H₀) and alternative hypotheses (H₁). If we were conducting a test for changing the color of a sign-in button, we would state the following:

H₀: If the color of the sign-in button is changed from blue to green then we will not see an increase in the conversion rate.
H₁: If the color of the sign-in button is changed from blue to green then we will see an increase in the conversion rate.

Typically, with the H₀, we are not expecting a positive lift or change to occur. When performing a DNH test, at first blush, it might make intuitive sense to assign the lack of change or risk mitigation to the null hypothesis and reason, if there truly is no change then fail to reject the null hypothesis and we’ve demonstrated the change has done no harm. Even on popular statistics websites, it is written that the null hypothesis should state there is no relationship between the two variables and your alternative hypothesis should state this relationship exists. But this line of reasoning is statistically incorrect. In a DNH experimentation framework, you will want to state your null hypothesis as there will be a change because of the test. Your alternative hypothesis will cover the lack of impact.

Let me explain. The goal of a hypothesis test is to reject the null hypothesis. More often than not, this means rejecting there won’t be a change. After all, your team has spent the time collecting data on the UX and put effort into crafting a variant they expect to have a positive impact, so of course you want to set up your test in a way that can reject the null hypothesis, so you can roll out your great new feature.

In sum, these are swapped:

H₀: If the color of the sign-in button is changed from blue to green then we will see an increase in the conversion rate.
H₁: If the color of the sign-in button is changed from blue to green then we will not see an increase in the conversion rate.

Another reason why you must set up your null hypothesis as the change in your KPI and the alternative as the DNH statement is the definition of a p-value. In statistics, a p-value is defined as “the probability, assuming that H₀ is true, of obtaining a random sample of size n that results in a test statistic at least as extreme as the one observed in the current sample” [emphasis is my own] (Camm et al., 2019). A p-value is predicated on the assumption that the null hypothesis is true, therefore it is critical to set up your test in a way that allows for the rejection (or failure therein) of the null hypothesis.

You may be wondering, why reject a null hypothesis, rather than just failing to reject the null hypothesis? Statistically, rejection is a stronger statement than failing to reject. Wording is important here too, if your sign-in button color test example from above fails to reject the null hypothesis, we haven’t “accepted” or “proved” the null hypothesis is true. This nuance is because if we fail to reject the null hypothesis, it doesn’t mean that the change doesn’t exist, your test just failed to collect the proper evidence that it does exist. We want to fully reject that a change exists to prove the experiment has done no harm.

‍

Taking the Oath

Understanding if the changes that have been made mitigate risk or not, is an important question for businesses to answer. Instead of just rolling out an unavoidable feature and hoping for the best, conduct a DNH test and monitor the results to make sure your KPIs are not negatively impacted.

Need help running a DNH test or another experiment? Concord’s team of A/B Testing and Experimentation experts can assist in your business’ testing needs. Reach out today!

Citation

Camm, J., Cochran, J., Fry, M., Ohlmann, J., Anderson, D., Sweeney, D., & Williams, T. (2019). Business Analytics (3rd ed., Chapter 6: Statistical Inference). Cengage South-Western.