0% found this document useful (0 votes)
44 views11 pages

Using A Multi-Armed Bandit With Thompson Sampling To Identify Responsive Dashers

Uploaded by

m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views11 pages

Using A Multi-Armed Bandit With Thompson Sampling To Identify Responsive Dashers

Uploaded by

m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

Using a Multi-Armed Bandit


with Thompson Sampling to
Identify Responsive Dashers
 March 15, 2022  10 Minute Read  Machine Learning 22

Arjun Sharma

Maintaining Dasher supply to meet consumer demand is one of the


most important problems for DoorDash to resolve in order to offer timely
deliveries. When too few Dashers are on the road to fulfill orders, we take
reactive actions to persuade more Dashers to begin making deliveries.
One of the most effective things we can do is to message Dashers that
there are a lot of orders in their location and that they should sign on to
start dashing. Dashing during peak hours can mean a more productive
shift, higher earnings, and more flexibility in choosing which offers to
accept.

We need to optimize which Dashers to target with our messages


because approaching Dashers with no interest in dashing at that time
can create a bad user experience. Here we will describe a bandit-like
framework to dynamically learn and rank the preferences of Dashers
when we send out messages so that we can optimize our decisions
about who to message at a given time.
https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 1/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

Finding the best way to alert Dashers


about low supply
Currently we select Dashers to message by identifying who has been
active in a given location and then selecting recipients at random. While
this approach doesn’t overload specific Dashers with messages, it
doesn’t improve the conversion rate of Dashers coming onto the
platform after receiving a push notification.

We need to find a methodology that uses our information about Dasher


preferences while avoiding spamming Dashers who wouldn’t be
interested in receiving notifications at that time. This problem statement
lends itself to finding a machine learning approach that can:

• Identify current responsive Dashers who are more likely to convert


when asked to dash now
• Identify Dashers who aren’t interested in these messages so we can
avoid spamming them
• Identify new responsive Dashers so that we don’t overtax our existing
responsive Dashers
• Rank Dashers by their willingness to dash when contacted so we know
how to prioritize who to message at each send

ML approaches we considered
One possible approach is to treat this as a supervised learning
classification problem. We can use past data that is labeled – for
example, we see whether a Dasher historically has signed on to dash
when invited – and try to create a model that predicts a driver’s
probability of dashing now when sent a message under a given set of
features.

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 2/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

While this approach is easy to frame as a binary classification model,


there are some issues with this approach. What if Dasher preferences
change over time? For example, a Dasher who is enrolled in college
could be very responsive during breaks, but largely unavailable once
school resumes. This type of non-stationary behavior would have to be
handled by the model trainer through retraining and heavily weighing
more recent observations.

Another problem with this approach is that it only optimizes for the
probability of dashing when a message is sent. With this approach, we
would only be sending messages to Dashers we already know are likely
to convert. There would be no basis to send messages to other Dashers,
giving them a chance to self-identify as responsive Dashers.

Because of our constraints and what we are optimizing for, there are
multiple benefits to using a bandit algorithm instead of supervised
learning. We can construct a bandit-like procedure that allows us to
dynamically explore Dashers to message, over time identifying and
optimizing on Dashers who respond to messages. This approach would
allow us to dynamically allocate traffic to Dashers who are more
responsive.

As Dasher preferences change over time, the algorithm can relearn


dynamically which Dashers would be most likely to convert. We can even
easily extend this framework to use a contextual bandit; if Dasher
preferences change based on time of day or day of week, the bandit can
be given these features as context to make more accurate decisions.

Next, we need to select which bandit framework to use in order to


allocate traffic to Dashers dynamically.

A trio of possible bandits

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 3/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

There are multiple factors involved in determining which bandit to use.


We want an algorithm that explores enough to adjust to changing
Dasher preferences and yet still sends messages to Dashers who we
already know are responsive. Several algorithms come to mind as
possible choices:

The Epsilon-Greedy algorithm defines a parameter – epsilon – that


determines how much to explore sending messages to Dashers about
whom we don’t know as much.

• Pros:
• Easy to understand and implement
• Makes it easier to prioritize known Dashers based on their likelihood
to respond to messages
• Cons:
• Because we have to define this constant epsilon percentage, it does
not improve over time. We can explore too little early on and too
much later in the process
• Experimentation is not dynamic; no matter what we have learned
about Dashers’ preferences, we are always exploring at a fixed
percentage

The Upper Confidence Bound (UCB) bandit algorithm is based on the


principle of “optimism in the face of uncertainty,” which translates to
selecting the action that has the highest estimated reward.

• Pros:
• Finds the best-performing Dashers quickly
• Once there’s enough data, starts to optimize sending messages to
responsive Dashers instead of exploring
• Cons:
• Difficult to communicate the strategy to stakeholders about why a
specific action was taken
• When there is an excess of new Dashers, this method could end up
only messaging new Dashers until enough signal is received
https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 4/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

Thompson Sampling takes a Bayesian approach to the problem. We


assign a prior probability distribution to each Dasher that is updated to a
posterior probability after reviewing observations.

• Pros:
• Intuitive approach that counts the successes and failures of each
message sent to a Dasher
• Depending on the probability distribution used, we can take
advantage of the conjugate relationship between prior and
posterior probabilities and use a simple update rule to get the
posterior probability
• Easy to implement
• Finds best-performing Dashers quickly
• Cons:
• Requires manually setting priors for new Dashers; an approach like
UCB always includes Dashers we have not previously messaged

Why we chose Thompson Sampling


Given these three frameworks, we selected Thompson Sampling for its
intuitive approach and ease of implementation.

We started by defining our target function: Determining what the


probability is that a Dasher who receives a message will convert and sign
on to DoorDash immediately. After this, we needed to compute a prior
for each Dasher from which we could sample to decide who to message.
A prior is a probability distribution that models the probability that a
given Dasher will respond when messaged. Along with choosing an
appropriate prior, we also need to have a method for updating it given
new information. We used a beta distribution to do this because it
directly uses the number of successes (alpha) and number of failures
(beta) to create a distribution of success. By using the conjugate
relationship between beta prior and posterior distributions, we
developed an intuitive update rule – add to alpha if a Dasher converts or,

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 5/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

if not, add to beta. As we update the distribution following each


message, the variance of the distribution shrinks as we become more
certain of the outcome.

Our last decision when defining the prior was whether to start at pure
exploration -- uniform distribution – or use past data to inform our prior.
We chose to inform each Dasher’s prior with previous messages and
conversion data to speed up the convergence of the distributions. We
apply a weight-decay parameter on previous observations to favor
recent data over historical observations. This way, when we start the
experiment, the bandit has a head start on Dasher preferences without
biasing too heavily to old – and potentially stale – data.

Next, we needed to tune a set of hyperparameters vital to modeling the


situation accurately. Among the steps we took were:

• Consider the length of each observation – over what time period


should we use to consider each observation? If it’s too short, we can’t
accumulate enough reward/penalty for each run. If too long, it takes
extra time to update the algorithm to find high-performing Dashers.
• How stationary is the problem? Dasher behavior changes over time, so
we must give greater weight to recent observations than those
recorded in the past. If a previously responsive Dasher ceases to
respond, we need to update our probability distribution quickly.
• What prior should we give new Dashers? It’s important to add new
Dashers to the algorithm without degrading our performance while
still giving them a chance to be selected so that we can learn if they
are a high-performing Dasher.
• Given that there's an imbalance in data (– a majority of many more
Dashers choose not to dash when messaged), – how much weight
should we give success vs. failure?

After defining our beta distribution, update rule, and these


hyperparameters, we are ready to use the bandit procedure to decide
which Dashers to message. In our experiment, whenever we are ready to

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 6/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

send out a message, we let the bandit sample all prior distributions to
give us the probability of converting when messaged. We then rank the
Dashers in descending order by their sampled value and take the top
Dashers whose sampled value is greater than a predetermined threshold
so that we don’t message Dashers who the bandit has determined won’t
convert. We define the number of Dashers to contact by first
determining how many are needed to resolve the current shortage. We
then divide that number by the average conversion rate for Dashers in
that location. The bandit then can message the Dashers who it has
determined are most likely to get on the road.

Results
Currently, we are running experiments to test this bandit framework
against our previous random sampling method. We are using a
switchback experiment to measure the impact that improved message
targeting has on the overall supply/demand balance for a given location.
Using this testing framework, we not only see if there is an increase in
Dashers who respond to messages, but we can also see what effect
these additional Dashers have on the market supply. So far, we have seen
an improvement in the conversion rate of messages sent in the bandit
framework, which has allowed us to send fewer messages than required
by our control variant. We are experimenting further to prove the
impact.

Conclusion
While we have tailored Thompson Sampling to a specific Dasher
scenario, this solution can work in many different scenarios. Companies
seeking to provide a personalized experience to all of their customers
may have limited data to figure out how to best accomplish that.
Thompson Sampling can help demonstrate which options give the
greatest reward in a non-stationary environment. The method works well
in a quickly changing business environment where there’s a need to
https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 7/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

dynamically optimize traffic. With a single model, we get the advantages


of velocity, dynamic traffic allocation, and a solution that handles
changing behavior over time.

While what we have done to date works well, there are many ways we
can improve upon this approach. Currently, we only consider whether a
Dasher signed on after receiving a message. But additional data lets us
know that Dashers’ preferences change based on their location, time of
day, day of week, and much more. Over time, we can encode this
information as contextual features so that the bandit can make even
smarter decisions.

Acknowledgements
This post is based in large part on the great work of our intern Hamlin Liu.
We are excited to have him join us full time in August!

Comments 

Share on:   

Popular Posts
Your Deep Links Might Be
Broken: Web Intents and
Android 12

 10 Minute Read

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 8/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

Building a Gigascale ML
Feature Store with Redis,
Binary Serialization, String
Hashing, and Compression

 21 Minute Read

Using ML and Optimization to


Solve DoorDash’s Dispatch
Problem

 18 Minute Read

Subscribe to stay up to date with the


lates engineering news and trends!

Email

Subscribe

Related Positions
Director/Senior Director, Marketing
Analytics
SAN FRANCISCO, CA; SEATTLE, WA; NEW YORK, NY;
LOS ANGELES, CA; CHICAGO, IL; AUSTIN, TX;
WASHINGTON, D.C.

See All Jobs

You May Also Like


https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 9/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

Machine Learning Data Machine Learning

Powering Search & How DoorDash is S


Recommendations at Data Platform to De
DoorDash Customers and Me
Customers across North America come to Learn the challenges and bes
DoorDash to discover and order from a vast successfully growing a data p
selection of their favorite stores. Our mission i… organization

Aamir Mitchell Sudhir


Manasawala Koch Tonse

 8 Minute Read     26 Minute Read

• • • • • • • • •

Get To Know Us

About us

Careers

Blog

Linkedin

Glassdoor

Accessibility

Let Us Help You

Account details

Order History

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 10/11
6/30/24, 12:05 PM Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers
Buy Gift Card

Help

Doing Business

Become a Dasher

Be a Partner Restaurant

Get Dashers for Deliveries

How to Call Our API

Privacy Policy Terms and Conditions

  

https://doordash.engineering/2022/03/15/using-a-multi-armed-bandit-with-thompson-sampling-to-identify-responsive-dashers/ 11/11

You might also like