May 29, 2018

Experiments With Code In Production

Written by developer-textnow

Tips & Tricks

As part of the TextNow iOS team, we regularly break down larger deliverables into smaller parts and keep deploying them in each sprint. Feature toggles allow us to keep the production code stable, and only turn it on if the entire feature works as expected.

Feature toggles allow you to write code, deploy them to production, and still not have it execute in production. This mechanism of shipping code that does not execute allows iterative feature development. Features can go out in smaller chunks in every sprint, and you do not have to resort to keeping long-lived feature branches around and wait for their maturity to merge into a release branch.

Here is an example of a feature toggle that takes 2 paths based on the toggle value.

Function performTask() {
   If (featureToggle [“feature203”] == ON} {
   } else {

This is a simple example purely used for illustration purposes. None of our feature toggles are actually this simple.

One thing to note here is that the above feature toggle is either ON or OFF. So either the code path is executed, or not executed. Obviously the next decision one has to make is when the feature can be turned ON.

  • Is it completely safe to use the new path? (i.e. do we have enough runs on the new code path to ensure we don’t crash the application or have other unexpected events?

  • If there are new UX designs involved in the new code path, then should we allow all users to see the new UX as soon as the feature is toggled ON?

  • Are we collecting enough data along the way to know whether the feature should be turned ON?

There are many more decision points that you can keep deliberating on without being able to take a firm YES or NO decision.

From black and white to shades of gray

Instead of thinking of this as a YES / NO decision, what about putting percentages on it? I.e 10% YES and 90% NO. This 10–90 split can be the confidence score of the individuals on the team on a feature. This confidence score can be transformed into segregating the audience to a corresponding 10–90 split. 10% of the user base will experience the new code path, while the remaining 90% will be oblivious to the new code path.

Now the decision-making for a feature has become much easier. By negotiating a better split between the ON and OFF paths, you can push the code out faster and start measuring based on actual sessions logged in each path, rather than just going with your gut feeling. This is called “experimenting”, or A/B testing.

Here’s what the new code looks like. Very similar to the old one but here we are also involving the current session in our toggle decision.

Function performTask() {
   // this method will only turn on for say 10 % of the sessions.
   If (isFeature203EnabledForThisSession()) {
   } else {

Experimenting in code

At TextNow we use sophisticated tools to do A/B testing. We run experiments on our user base (2MM DAU as of this month and growing), collect metrics, and then keep turning on the lever until all our users experience the new code paths.

Here’s how we start with experimentation: First, we define certain variables in the code. Typically, it’s a boolean variable. We then write code around this decision point, and define 2 code paths. (NOTE : A variable can also have many more values, i.e. 0–10. I am just choosing a simple one to explain how we experiment.)

Once the code paths are defined, we call the old path for the control group, and the new path for the experimental group. This is also called the variant, since it varies from the norm. One can have many variants, one for each value the variable takes on.

The value for the variables are assigned based on the control vs experimental group distribution. In this example, out of a population of 2 MM users, 90% of the users will get the old code path value, and the rest will get the new code path value.

Audience for an experiment

The audience for an experiment is usually based on App Version where the new code path has been added. (There are many other sophisticated filtering mechanisms that can be applied to define an audience). Once this audience is selected, the control and experimental group percentages are defined. Each is given a value for the experimental variable.

Measuring the experiment

Once this experiment is in play, we start measuring the experiment. There are direct and indirect measures for an experiment. Direct measurements can be things like: crashes generated in the 2 code paths, events triggered on user actions, events triggers on unexpected events etc. Indirect measures can be retention (7-day, 14-day etc.), user engagement (Session length, unique session per user, time in app. etc.)

After a 2 week period with the experiment running we check the numbers and, if the numbers are positive, then we turn up the lever on the numbers and start going with aggressive deployments. If numbers are not good we turn down the percentages, or even turn it off completely and get back to work on fixing issues.

Ending the experiment

Every 2 weeks we either decide to increment the audience for the experiment or decrement the audience for the experiment. When it either reaches 100% or 0% then it is time to dismantle the experiment from the code. This cleanup is necessary to ensure we do not leave experimental code persist past its expiry date.

And if you want to go more in-depth, I recommend checking out Martin Fowler’s blog for a deep dive on Feature Toggles.

At TextNow we are solving interesting problems like this everyday. If you like a good challenge, check out some of our Engineering job openings.