Feature Flag Best Practices
Feature Flag Best Practices
Feature Flag Best Practices
1. Introduction
2. 2. The Moving Parts of a Feature-Flagging System
a. Performance
b. Configuration Lag
c. Security
d. Implementation Complexity
a. Code First
b. Data First
c. Big Bang
d. Expand-Contract Migrations
e. Duplicate Writes and Dark Reads
f. Working with Databases in a Feature-Flagged
World
i. Inference
ii. Causality
b. Categories of Feedback
12. 12. Summary
Chapter 1. Introduction
At its core, feature flagging is about your software being able to choose
between two or more different execution paths, based upon a flag
configuration, often taking into account runtime context (i.e., which user is
making the current web request). A toggle router decides the execution
path based on runtime context and flag configuration.
To achieve this, we modify our checkout page rendering code so that there
are two different execution paths available at a specific toggle point:
renderCheckoutButton(){
if(
features
.for({user:currentUser})
.isEnabled(“showReallyBigCheckoutButton”)
){
return renderReallyBigCheckoutButton();
}else{
return renderRegularCheckoutButton();
}
}
Every time the checkout page is rendered our software will use that if
statement (the toggle point) to select an execution path. It does this by
asking the feature-flagging system’s toggle router whether the
showReallyBigCheckoutButton feature is enabled for the current
user requesting the page (the current user is our runtime context). The
toggle router uses that flag’s configuration to decide whether to enable
that feature for each user.
Let’s assume that the configuration says to show the really big checkout
button to 10% of users. The router would first bucket the user, randomly
assigning that individual to one of 100 different buckets. The router would
then report that the feature is enabled if the current user has landed in
buckets 0 through 9, but disabled if they’d landed in any of the
remaining buckets (10 through 99).
You, as the person in charge of the rollout, set up this new feature to
initially be seen by 10% of the user population. Exposing the feature from
zero to a subset of the population is often called ramping up a feature.
Here, a feature was ramped to 10%.
Increasing the exposure to a broader user population should not affect the
current exposure of variations to users—if a user experienced a feature
when it ramped to 40%, that user should continue to see it as it ramps to
60%, 80%, and so on. In others, existing allocations should remain intact.
A particular case occurs if you were to “de-ramp” (reduce exposure of) the
feature; for example, reducing exposure from 10% to 5%, as in Figure 3-2.
We know that user A was part of the “on” group in the 10% sample.
Unless your feature-flagging system has the notion of “memory” to
remember the prior allocation of A, there is little you can do to maintain
user A in the “on” group when reducing exposure, just because we don’t
know a priori whether user A will be in the “on” or “off” group.
Figure 3-2. Flag consistency during feature de-ramp
Chapter 4. Best Practice #2:
Bridge the “Anonymous” to
“Logged-In” Transition
When dealing with an anonymous user, you first need to decide whether
it’s important to maintain feature-flag consistency during the transition
from visitor ID to user ID.
Performance
By moving flagging decisions to the server side, you gain user-perceived
performance. Single-page applications are already making a server-side
call to render the data needed for the UI. At this time, you could also make
a call to a feature-flag service, so one network call fetches all feature-flag
evaluations with the server-side data.
Configuration Lag
One way in which engineers improve performance of an application is to
cache data locally, thereby reducing network latency. This has an impact
on where the feature flag decision should be made. You could opt to
proactively request all flagging decisions for a specific runtime context
(i.e., current user, browser, and geolocation) from the server. Or, you
could just request the current feature-flagging configuration and make
flagging decisions using a client-side toggle router. In both approaches,
you are at risk of Configuration Lag.
Security
Whenever you move a feature-flagging decision to the client, you’re
exposing information about the existence of those decisions—anyone
who’s able to log in to your product can potentially see what product
features are under active management and can also manipulate which
variants they experience. If you’re concerned about industrial espionage or
particularly nosy tech journalists, this might be relevant, but that’s
unlikely to be the case for the typical feature-flag practitioner.
Implementation Complexity
Most delivery teams working with feature flags need the ability to make a
server-side toggling decision. If a team also begins making toggling
decisions on the client side, it significantly increases the complexity of its
feature-flagging system. There will now be two parallel implementations,
which are likely to be implemented in multiple languages (unless you’ve
opted to implement your backend in JavaScript, in addition to your
frontend). These parallel implementations need to remain synchronized
and make consistent toggling decisions. And, as discussed earlier, if you
begin adding client-side caching into the mix, things can get quite
complicated.
Code First
We can perform the code deployment first, shown in Figure 6-2, making
sure that the new version of our code is backward-compatible with the
existing database schema.
Data First
Alternatively, we can perform our database migration first, as shown in
Figure 6-3, which means that we must ensure that the new schema is
backward-compatible with the existing code.
Figure 6-3. Data-first approach
Big Bang
In simple systems, there’s a third option (Figure 6-4): update data and
code simultaneously in a lockstep deployment in which you stop your
system, update your data to support your code change, and then restart the
system with your new code.
With the Big-Bang approach, you don’t need to worry about backward or
forward compatibility, because you’re updating both data and code in
concert. New code will never see old data, and old code will never see
new data.
Expand-Contract Migrations
When a feature-flagged code change requires a corresponding data schema
migration, this migration must be performed as a series of backward- or
forward-compatible changes, sometimes referred to as an Expand-
Contract migration, or a Parallel Change. The technique is called
Expand-Contract because the series of changes will consist of an initial
data-first change that “expands” the schema to accommodate your code
change, followed by a code-first change that “contracts” the schema to
remove aspects that are no longer needed.
When a user is done filling their cart and is ready to check out, they make
a request to the Web App, which in turn asks the Checkout service to
create a new checkout. The Checkout service does this and returns
information about the checkout back to the Web App, including details
about the shipping costs associated with the items in this checkout (at
Acme, shipping costs are not calculated until the user finalizes the order
and checks out).
Figure 7-2. Placing the “free shipping” flag decision in the Checkout service
But is that really the best place to make this flagging decision? Suppose
that we initially want to roll out free shipping for internal testing (an
example of the virtual UAT technique that we discussed earlier). This
means that we’d need to take into account which user is requesting the
checkout when deciding whether to allow free shipping. The user would
need to be part of the runtime context for the flagging decision. However,
our Checkout service doesn’t know anything about users—its sole
responsibility is creating and managing checkouts.
One way to solve this would be to have the Web App pass the necessary
runtime context through to the Checkout service, as demonstrated in
Figure 7-3, by telling it which user is requesting to check out. This would
allow the Checkout service to make the flagging decision locally.
Figure 7-3. Web App passes the necessary runtime context through to the Checkout
service
On the other hand, the Web App is already deeply aware of the concept of
users and already has the context of which user is requesting a checkout.
The Web App is actually in a better position to make our “free shipping”
flagging decision, as shown in Figure 7-4.
Figure 7-4. Placing the “free shipping” feature decision in the Web App
Now, every time a user proceeds to checkout, the web app will use the
feature-flagging system to decide whether this user should receive free
shipping. That decision is then passed to the Checkout service as an extra
parameter when creating the checkout. Note that the Checkout service still
has the responsibility of determining whether the order is eligible for free
shipping; that is, if the total value of the order is more than $50. We’ve
avoided having to change the scope of our various components just to
support a flagging decision, and we’ve kept the underlying business logic
that’s powering the feature in the right place.
We could create a new feature flag called “free shipping banner,” and use
that to manage the display of the banner ad. But we’d need to make sure
that this banner wasn’t ever on when the other “free shipping” feature was
off; otherwise, we’d be back to grumpy customers unhappy that they will
not be getting the free shipping.
Although this might seem like an obvious best practice, it’s sometimes not
so obvious when flags are grouped by team, and the team implementing
the shipping calculation code is very disconnected from the team
implementing the banner ad. Ideally, your product delivery teams are
already oriented around product features (i.e., the product detail page
team, the search team, the home page team) rather than technology or
projects (i.e., the frontend team, the performance optimization team). This
reduces the number of features that require cross-team collaboration, but
does not eliminate it. Some features will always require changes across
multiple teams. When that happens, it’s OK to bend the rules and have a
flag that’s being used by multiple teams. You should, however, still
always aim to have each flag owned by a clearly identified team. That
team is responsible for driving rollout of the feature, monitoring
performance, and so on.
Chapter 9. Best Practice #7:
Consider Testability
There are two approaches here. On one end we have high-level testing,
often called end-to-end testing or black-box testing. This approach tests
the functionality from the outermost layer and ensures that the expected
behavior is working without caring for how the underlying components
operate. When using feature flags, high-level testing must assure that the
application will produce the expected behavior when the feature is turned
on.
On the other hand, when pursuing lower-level unit testing, you should try
to isolate the functionality the flag is gating and write tests to target both
behaviors; testing the functionality for when the flag is on and when the
flag is off.
Last, it’s easy to fall into the temptation of writing tests for all possible
combinations of flags. We advise reducing the scope of the test
components to span only a handful of flags or isolate the test so that the
tester can test the main target flag with the rest of the flags turned on.
Chapter 10. Best Practice #8:
Have a Plan for Working with
Flags at Scale
As engineers from different teams create, change, and roll out feature flags
across the application stack, tracking and cleanup can get out of hand over
time. Maintaining the who/what/why of feature flags and establishing a
basic process for identification and tracking will reduce complexity down
the road.
The feature name example that follows has three parts. First, we present
the name of the section the feature is gating. In this example, the feature is
gating functionality in the admin section. The second part indicates what
the feature does, a self-explanatory naming: make the new invite flow
visible to users. And, last, where in the stack the feature is located: here it
belongs to the backend layer of the application.
This naming pattern looks like this:
section_featurepurpose_layer
admin_panel_new_invite_flow_back_end
Here are a couple of other examples of the same feature but in different
layers of the stack:
admin_panel_new_invite_flow_front_end
admin_panel_new_invite_flow_batch
An alternative naming structure can include the team that created and
owns the flag; for instance, Data Science, Growth, Data Infrastructure.
You might also want to include the name of the service for which the flag
is used: web app, router, data writer, and so on.
One useful technique is to add a flag retirement task to the team’s work
backlog whenever a flag is created. However, these tasks can have a nasty
tendency of being perpetually deprioritized, always one or two weeks
away from being tackled.
Assigning an exact expiry date to every flag when it is created can help
break through this tendency to prioritize the urgent over the important.
Your feature-flagging system should have some way to communicate this
information, ideally with a highly visible warning if a flag has expired.
You can also opt to place a limit on the number of active flags a given
team has under management. This incentivizes the removal of old flags to
make room for a new flag.
Flagging Metadata
Attaching information like expiration dates and ownership to a flag are
specific examples of a more general capability: the ability to associate
metadata to your flag.
Inference
If we publish changes to our feature-flagging configuration as a separate
stream of instrumentation events, we can use the timing of these changes
as a way to correlate a feature-flag change with its effect. For example, if
we see 10% of servers having an increase in CPU utilization at the same
time as a feature flag change that rolled out to 10% of servers, we can
pretty easily infer a correlation.
This approach enables correlation in most simple scenarios but has some
drawbacks. In the case of a 50% feature roll out, it will be difficult to
know which effect is caused by the flag being on and which by it being
off. It’s also more difficult to draw correlations when the impact of a
feature flag takes some time; for example, a change that causes a slow
memory leak, or a change that affects a user’s behavior in later stages of a
conversion funnel. The fundamental issue is that we’re inferring the
correlation between a feature change and its effects.
Causality
A more sophisticated approach is to include contextual information about
feature-flag state within an analytics event. This approach, most
commonly described as experimentation, ties metrics to a feature flag to
measure the specific impact of each change. For example, the analytics
event that reports whether a user clicked a button can also include
metadata about the current state of your feature flags. This allows a much
richer correlation, as you can easily segment your analytics based on flag
state to detect significant changes in behavior. Conversely, given some
change in behavior, you can look for any statistically significant
correlation to the state of flags. Experimentation establishes a feedback
loop from the end user back to development and test teams for feature
iteration and quality control.
Categories of Feedback
You might already have noticed from the discussion so far that a feature
change can have a broad variety of effects. We might observe an impact
on technical metrics like CPU usage, memory utilization, or request
latency. Alternatively, we might also be looking for changes in higher-
level business metrics like bounce rate, conversion rate, or average order
value.
Ideally, we will look across both categories when looking for the impact
of a feature change. One thing to note when looking to correlate a feature
change with its effects is that business analytics is typically oriented
around the user, at least for a typical B2C product, whereas technical
metrics usually focus on things like servers and processes. You’ll likely
want to use different contextual information for these different categories
of feedback.
Chapter 12. Summary
In this book, we’ve offered advanced users of feature flags some best
practices for working with feature flags. Following tips such as
maintaining flag consistency in different scenarios, development and
testing with feature flags, and working with feature flags at scale, will help
you to manage a growing feature-flag practice.