2012-06 ART ControlledExperimentsTutorialAll
2012-06 ART ControlledExperimentsTutorialAll
1
4
Concept is trivial
Randomly split traffic between
two (or more) versions
A/Control
B/Treatment
Collect metrics of interest
Analyze
Must run statistical tests to confirm differences are not due to chance
Best scientific way to prove causality, i.e., the changes in metrics are
caused by changes introduced in the treatment(s)
2
7
A B
• Raise your right hand if you think A Wins
• Raise your left hand if you think B Wins
• Don’t raise your hand if you think they’re about the same
Differences: A has taller search box (overall size is the same), has magnifying glass icon,
“popular searches”
B has big search button
3
10
Insight
Stop debating, it’s easier to get the data
11
is better than
4
13
14
15
5
16
17
18
6
Avoid the temptation to try and build optimal features
through extensive planning without early testing of ideas
Experiment often
To have a great idea, have a lot of them -- Thomas Edison
If you have to kiss a lot of frogs to find a prince,
find more frogs and kiss them faster and faster
-- Mike Moran, Do it Wrong Quickly
Try radical ideas. You may be surprised
Doubly true if it’s cheap to implement (e.g., shopping cart
recommendations and Behavior-Based search at Amazon)
If you're not prepared to be wrong, you'll never come up
with anything original – Sir Ken Robinson, TED 2006
20
If you remember one thing from this talk, remember this point
OEC = Overall Evaluation Criterion
Agree early on what you are optimizing
Getting agreement on the OEC in the org is a huge step forward
Suggestion: optimize for customer lifetime value, not immediate
short-term revenue
Criterion could be weighted sum of factors, such as
Time on site (per time period, say week or month)
Visit frequency
Report many other metrics for diagnostics, i.e., to
understand why the OEC changed and raise new hypotheses
21
7
22
Measurement
• Semmelweis worked at Vienna’s General Hospital, an
important teaching/research hospital, in the 1830s-40s
• In 19th-century Europe, childbed fever killed more than a
million women
• Measurement: the mortality rate for women giving birth was
• 15% in his ward, staffed by doctors and students
• 2% in the ward at the hospital, attended by midwives
8
Insight
Control
• He tries to control all differences
• Birthing positions, ventilation, diet, even the way laundry was done
• He was away for 4 months and death rate fell significantly
when he was away. Could it be related to him?
• Insight:
• Doctors were performing autopsies each morning on cadavers
• Conjecture: particles (called germs today) were being transmitted
to healthy patients on the hands of the physicians
He experiments with cleansing agents
• Chlorine and lime was effective: death rate fell from 18% to 1%
9
Accept Results
Measure and avoid Fundamental
Hubris Semmelweis
Control Understanding
Reflex
29
30
10
31
33
11
34
If you don't know where you are going, any road will take you there
—Lewis Carroll
35
Primacy/novelty effect
Primacy: Changing navigation in a website may degrade the customer
experience (temporarily), even if the new navigation is better
Novelty: new flashing icon catches the eye and everyone clicks once
Evaluation may need to focus on new users, or run for a long period
Multiple experiments
Even though the methodology shields an experiment from other changes,
statistical variance increases making it harder to get significant results.
There can also be strong interactions (rarer than most people think)
Consistency/contamination
On the web, assignment is usually cookie-based, but people may use
multiple computers, erase cookies, etc. Typically a small issue
Launch events / media announcements sometimes
preclude controlled experiments
The journalists need to be shown the “new” version
12
37
38
Ramp-up
Start an experiment at 0.1%
Do some simple analyses to make sure no egregious problems can be
detected
Ramp-up to a larger percentage, and repeat until 50%
Big differences are easy to detect because the min sample size is
quadratic in the effect we want to detect
Detecting 10% difference requires a small sample and serious problems can
be detected during ramp-up
Detecting 0.1% requires a population 100^2 = 10,000 times bigger
Abort the experiment if treatment is significantly worse on OEC or
other key metrics (e.g., time to generate page)
39
13
40
41
42
14
43
44
45
15
A: When user clicks on email
hotmail opens in same window
B: Open hotmail in separate window
Trigger: only users that click in the
module are in experiment
(no diff otherwise)
OEC: clicks on home page (after trigger)
48
16
49
The experiment report was sent by the BI/CI team to all multiple
teams across the world
Someone who saw the report wrote
This report came along at a really good time and was VERY useful.
I argued this point to my team (open Live services in new window from
HP) just some days ago.
They all turned me down.
50
51
17
Proposal: New Offers module below Shopping
Control
Treatment
Value proposition
The Offers module appears below the fold
Sales estimated the three ads would sell for several millions of
dollars a year
Concern
Do more ads degrade the user experience?
How do we trade the two off?
Experiment!
18
Determine impact of 2 factors for video ads.
1) Factor A: pre-roll vs. post-roll ads
2) Factor B: time between ads (90, 120, 180, 300, 900 seconds)
OEC: revenue from ad starts
56
A: Solitaire game
in hero position
B: Poker game
in hero position
A is 61% better
19
For many years, the prevailing
conception of illness was that the sick
were contaminated by some toxin
Opening a vein and letting the sickness
run out – bloodletting.
One British medical text recommended bloodletting for
acne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes,
indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke,
tetanus, tuberculosis, and for some one hundred other diseases
Physicians often reported the simultaneous use of fifty or more leeches
on a given patient.
Through the 1830s the French imported about forty million leeches a
year for medical purposes
Lancet
20
61
Ronny Kohavi
62
63
21
64
65
66
22
67
68
69
23
70
71
Essentials (CONT)
3. Anticipate and exploit early information
a) Front-load to identify problems and provide guidance when it's cheap
b) Acknowledge trade-off between cost and fidelity.
Low-fidelity experiments (costing less) are suited in early exploratory stages
4. Combine new and traditional technologies
a. Today's new technology might eventually replace its traditional counterpart,
but it could then be challenged by tomorrow's new technology
24
Roger Longbotham,
Principal Statistician, Microsoft
What to measure
How to compare Treatment to Control
How long to run test
Start up options
Good test design
Data validation and cleansing
Before your first experiment
Common errors
MultiVariable Tests
Advanced Topics
1
Measures of user behavior
Number of events (clicks, pageviews, visits, downloads, etc)
Time (minutes per session, total time on site, time to load page)
Value (revenue, units purchased, ads clicked)
Analysis units
Per user (e.g. clicks per user)
Per session (e.g. minutes per session)
Per user-day (e.g. pageviews per day)
Per experiment (e.g. clicks per pageview)
2
Metric plotted is number of clicks in an hour divided by the number of pageviews
3
Single Treatment
Two-sample t test works well
Large samples sizes => Normal distribution for means
Calculate 95% Confidence Interval (CI) for difference in two means
Included:
Averages for
both variants
P-values
Percent change
Significance
Confidence
Intervals
103 metrics
4
May need to consider
Sample size needed for normality of means
16 * r *σ 2
n=
∆2
Power depends on
• The size of effect you want to be able to detect, ∆
• Variability of the metric
• Number of users in each group (T/C)
5
Example: Number of users needed for each variant (group)
to achieve 80% power, with equal number of users in
Treatment and Control and with standard deviation s is
32 * s 2
N=
∆2
3.00%
2.00%
1.00%
0.00%
-1.00%
-2.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
6
Can happen when numerators and denominators are accumulated
over groups of unequal sizes
Famous example: UC Berkeley gender bias lawsuit. The graduate
school was sued in 1973. Admission rates:
Applicants Admitted So, a larger percentage
Men 8442 44% of men admitted than
Women 4321 35% women
In these two departments, admission rates did not seem to favor men, but when
combined, the admission rate for men was 55% and for women it was 41%
50%
40%
30%
20%
10%
0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Control Treatment % in Tmt
Control T1 T2
T3 T4 T5
7
The widget that performed the best was the simplest
Note Ronny’s example earlier compared the best Treatment to another Treatment, not the Control
Triggering
Blocking
Measuring non-test factors
Randomization
8
Triggering
Blocking
Measuring non-test factors
Randomization
Content
Ex: If content of a site changes during the experiment it must be the
same for both Treatment and Control at all times
User
May use before/after for users or a cohort group
Updates to site
9
The Treatment and Control groups should be as alike as
possible except for application of the treatment
Who is in the experiment
What is done during the experiment
etc.
1.0%
Clickthrough Rate
0.8%
0.6%
CTR_Control
0.4% CTR_Tmt
0.2%
0.0%
10/15/07 14:00
10/15/07 18:00
10/15/07 22:00
10/16/07 2:00
10/16/07 6:00
10/16/07 10:00
10/16/07 14:00
10/16/07 18:00
10/16/07 22:00
10/17/07 2:00
10/17/07 6:00
10/17/07 10:00
10/17/07 14:00
10/17/07 18:00
10/17/07 22:00
10/18/07 2:00
10/18/07 6:00
10/18/07 10:00
10/18/07 14:00
10/18/07 18:00
10/18/07 22:00
10/19/07 2:00
10/19/07 6:00
10/19/07 10:00
10/19/07 14:00
10
Triggering
Blocking
Measuring non-test factors
Randomization
Triggering
Blocking
Measuring non-test factors
Randomization
11
Why randomize?
12
Remove robots (web crawlers, spiders, etc.) from analysis
They can generate many pageviews or clicks in Treatment or
Control skewing the results
Remove robots with known identifiers (found in the user agent)
Develop heuristics to identify robots with many clicks or
pageviews in short period of time
Other patterns may be used to identify robots as well, such as
very regular activity
13
Not conducting logging or A/A tests
Find caching issues, UID reassignment
Not keeping all factors constant or blocking
Content changes to site
Redirect for Treatment but not for Control
Sample size too small
Caution with using surrogates for OEC!
Measure clicks to buy button (instead of revenue)
Clicks to download button (instead of completed downloads)
Factors/variables
(This is for illustration purposes only, it does not reflect any previous or planned test on MSN HP)
14
Advantages:
– Can test many things at once, accelerating innovation
– Can estimate interactions between factors
Disadvantages
– Some combinations of factors may give negative customer
experience
– Analysis and interpretation is more difficult
– May take longer to set up test
15
Procedure for analyzing an MVT for interactions
1. Since there are potentially a vary large number of interactions
among the variables being tested, restrict the ones you will
look at to a few you suspect may be present. (If 7 factors, 21
two-factor interactions, 35 three-factor interactions, etc.)
2. Conduct the test to determine if the interaction between two
factors is present or not
3. If interaction is not significant, stop!
If the interaction IS significant, look at the graphical output to
interpret.
Factors/variables
(This is for illustration purposes only, it does not reflect any previous or planned test on MSN HP)
16
If hypothesis test for interaction is not significant
Assume no interaction present
Interaction graph would show lines approximately parallel
If interaction is statistically significant
Plot interaction to interpret
4.11
4.10
4.09 F3 - C
4.08 F3 - T
4.07
4.06
4.05
F2 - C F2 - T
17
Case 2: Synergistic Interaction
Data Table Main Effects Results
F2xF3
F2xF3Interaction
Interaction
4.14
4.14
Synergistic
Synergistic Interaction
Interaction
4.13
4.13
UserHP
MSN
4.12
4.12
per
Visit
4.11
4.11
Clicks
F3 - C
Days
4.10
4.10 F3 - C
Average
F3 - T
Number
4.09
4.09
4.08
4.08
4.07
4.07
Microsoft Confidential
F2
F2 -- C
C F2
F2--TT
4.12
per
Visit
4.11
Clicks
Days
4.10 F3F3
-C-C
Average
Number
4.09 F3 - T
4.08
4.07
F2 - C Confidential
Microsoft F2
F2-- TT
18
For metrics that are not “per user” (i.e. not the same as the
randomization unit) – cannot use usual standard dev
formula
Can use bootstrap or delta method to estimate variance
Delta method uses a formula to take into account correlation of
experimental units
Example: Clickthrough rate (CTR) per experiment
19
A population segment is interesting if their response to the
Treatment is different from the overall response
Segments can be defined by a number of variables
Browser or operating system
Referrer (e.g. from search engine, etc.)
Signed-in status
Loyalty
Demographics
Location – country, state, size of city (use IP lookup)
Bandwidth
BREAK
20
Ronny Kohavi, Microsoft
1
4
Why?
2
7
-0.40%
Less negative on day 3: -0.21% -0.80%
8/31/2011
9/1/2011
9/2/2011
9/3/2011
9/4/2011
3
10
0.00%
of falling outside the 95% CI -0.20%
0 5 10 15 20
11
0.20%
-0.20%
-0.60%
-1.00%
-1.40%
12
4
13
14
15