0% found this document useful (0 votes)
11 views8 pages

MAB Demo, Thompson Sampling, RL Sell Like A Wolf, Q-Learning

AIB

Uploaded by

Piyush Sonawane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

MAB Demo, Thompson Sampling, RL Sell Like A Wolf, Q-Learning

AIB

Uploaded by

Piyush Sonawane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MAB demo, Thompson Sampling, RL sell like a

wolf, Q-learning

Submitted to:
Prof. M. P. Sebastian

Submission by:
Piyush S Sonawane
PGP/25/115
We studied the Multi-Arm Bandit issue and how models are created to solve it in order to
comprehend the reward-based mechanism of the ML system. We also learned about the
Thompson sampling method for choosing the optimal reward-generating slot machine. The
use of robots in warehouses, sales, and advertising was also demonstrated using a similar
approach.

MAB Problem

Step 3 established the values for "bandits," or slot machines, and a programme was used to
distribute random prizes with each lever pull.
The class that would stimulate the game was then defined. Gaussian Bandit 1, 2, and 3
received Slots A, B, and C, correspondingly. The game was then played as normal by
choosing the machines and providing input values between 1 and 3, which awarded us
with prizes.

If a business wants to determine which banner advertising perform for them by examining
the click-through rate (CTR), they can do it via A/B testing using a similar technique of
exploring and exploiting. After defining the Bernoulli Bandit's values, A/B testing may be
conducted.

The software could determine that the best-performing advertisement was E.


A/B/n testing was conducted among 100000 comparison heads while comparing the
rewards for the ad index. The average reward was then calculated and shown on a graph,
coming out to be 0.0296.

Similar to this, EPS greedy explores and exploits rewards with an experimentation point of
1-epsilon probability. In order to take advantage of the most decisions, set epsilon to a value
between 0 and 1. The best-performing advertisement for the previously provided values was
accurately identified as E, and the action reward for e-greedy behaviour was determined to
be 0.0304, which was a good value.

A method comparable to this one to determine the exploration-exploitation trade-off is


called Upper Confidence Bounds (UCB). In order to determine the average rewards, the c
value was calculated at the points 0.1, 1 and 10. Again, the computer determined that E and
0.0297 were the best average reward and best ad, respectively.
Robots in Warehouse – Logistics

The robots were programmed to move in an alphabetical order from A to L along a


predetermined course, with action numbers ranging from 1 to 11. The array was established,
and 1000 was assigned to a certain alphabet to narrow the path. The 1000 value path will now
be taken by the programme without a doubt.

The quickest path from E to G was discovered and printed.

With an intermediate target, the identical challenge of choosing the optimum route was now
completed. K has to be inserted within the program's path from E to G. Thus, it was
determined that the optimal route was E-I-J-K-L-H-G.
Without using any middlemen or obstructing the route's flow, the identical procedure was
repeated. The optimal paths from E to G and E to D were now printed.

Thompson Sampling for slot machines

For a second time, Thompson sampling was used to address the explore and exploit issue.
Ten thousand samples were defined, and five conversion rates were specified with varying
values. A dataset was then defined using an array. The arrays were designed to track wins
and losses. The maximum awards were tallied during beta distribution. The best slot machine
values were displayed as an array, and it was determined that slot machine 4 was the best and
offered the highest payouts.
AI for Sales and Advertising

With 10,000 samples and nine conversion rates, a simulation matrix was created for various
sales and advertising methods. The most lucrative strategies were then chosen with the aid of
Thompson sampling and random selection. The relative return was calculated to be 91%, and
the corresponding graph was then created. The random selection of the programme more
than 8000 times showed that Strategy 6 was definitely more profitable than any other
strategy. Therefore, the greatest approach to use for sales and promotion is possibly strategy
6.

You might also like