Machine Learning : Reinforcement Learning - Solving Multi Armed Bandit Problem with Thompson Sampling (Part 24)

Let's start with the multi armed bandit problem

we want to know that which machine will be good to get a big loot.

we have solved that using UCB method here

Let's solve it using using Thompson method

So, the problem statement: (Multi bandit problem)

Let's learn bayesian interface

thompson sampling algorithm

Simulation

here we have blue, green and yellow in y axis as the return.
Here each machine has their own performance distribution and from that, we create this lines.

So this is just the center of that distribution or the actual expected return from that machine. And we're just visualizing it here. But this is us kind of like looking into the actual machine itself or like pulling it apart and knowing how it works or being the designer of the machine.

In reality, when you're playing these machines of course you don't know this. So this is some additional information that will guide us, that will help us understand how the algorithm actually works.

The algorithm itself doesn't know this information, right? So this is hidden, but it's just there for us so that we can better understand what's going on.

You have no prior knowledge of the current situation or status of things

and therefore all the machines are identical to you and you have to have at least one or even a couple of trial rounds just to get some data to analyze.

And so let's say that has happened. There's some trial rounds for machine number one or the blue machine. And based on those trial runs, the algorithm the Thompson sampling algorithm

There's some trial rounds for machine number one or the blue machine. And based on those trial runs, the algorithm the Thompson sampling algorithm, this is where it starts getting different to the upper conference bound will construct a distribution, right?

Let's just do the same thing for machine, the green machine.

For the yellow machine

So these distributions are actually showing us or they're representing not the distributions, we're not trying to guess the distributions behind the machine.

So the first thing that might come to mind is that all right, so we've constructed distribution. And so the blue distribution is our attempt and guessing the actual distribution behind the blue machine, right?

The distribution that this is the expected return for and the green is for the green machine, the yellow is for yellow machine.

Well, actually not, that's not the case.

We are constructing distributions of where we think the actual expected value might lie.

this is where we think the expected, the actual expected values will be.

We're trying to mathematically explain what we think is actually going on or what could be going on. And that is also important thing because this demonstrates that the Thompson sampling is a probabilistic algorithm.

The upper confidence bond (UCB) was a deterministic algorithm where everything was strict, everything was okay.

Assume that we're in a new round we have to choose a machine to use. So what the algorithm will do is it will go and call this distribution and it'll pull out of value out of this distribution.

Let's say it pulls a value from the blue distribution

let's say it pull that value out of the green distribution

and then pulls a value out of yellow distribution, is that distribution, that value.

And again, it's pulling them.

And so now it's picked these values and guess what that means?

Well, what this actually means is that, we have generated our own bandit configuration.

So we have created this hypothetical imaginary set of machines in our own virtual world where we are saying,

okay, so the expected, the actual expected return for the blue machine is this value the actual expected return for the green machine is this value. And the actual expected return for the yellow machine is this value.

Because this machine (green) has the highest expected return (max value in x axis) out of the three, and obviously just gonna go with this machine in the virtual world in the reality.

So now this information this is new information to the algorithm.

What it's going to do is it's going to say, Ah-huh, Okay so I pulled the green lever the lever for the green machine, I got this value. So now I have to adjust my perception of the world, right?

So I have a prior probability. So these are my, well, for the green machine this is my prior distribution this is where the Bayesian inference comes into play,

Well, perception of the world has changed the distribution has shifted a bit and it's become narrower because we have more information.

This is just an example to demonstrate what we're talking about, to get the point across.

But that's the point that every time we add new information our distribution becomes more and more refined.

Let's do the process once again

Again, we generate or we pick some values out of our distributions that they are, now, we've constructed a or we've generated our own bandit configuration in our virtual reality or in our hypothetical world.

Out of these three, we have to pick the best bandit which is of course the one here, the yellow bandit.

And we are going to now pull, actually pull in the real world, pull the lever of the yellow bandit, or the algorithm's gonna do that. That means in real world , for the yellow machine we will start spending money on this machine.

That's gonna trigger the distribution behind the yellow bandit. And that will give us some sort of value. So this is the actual value that we received in the real world.

Meaning , we guessed that yellow machine is the best now and we started playing with it. So, it surely will give us some results now.

Now we're gonna incorporate that value into our perception of the world and our perception of the world's going to change, is gonna adjust.

So, it became narrower and increased a bit.

Now we're going to do that again.

New bandit configuration

and got yellow again as the best one

Pull the yellow machines or lever that's going to trigger the distribution

behind the yellow machine is gonna spit out a value in the real world, there's a value.

And then we're going to have to adjust our perception of the world again to match the, or to incorporate the new information. Meaning it will be narrower and increase a bit.

So, this rounds will go on and on untill we get this

So there we go. That is pretty much how the Thompson sampling algorithm works. And as you can see, it is a probabilistic algorithm. And every time we're generating these values, we're kind of creating this hypothetical setup of the bandits, and then we're solving that and then we're applying the results to the real world.

We're adjusting our perception of reality based on the new information that that generates.

UCB vs Thompson Sampling

UCB is a deterministic algorithm, and there's lots of different modifications to the UCB algorithm.

Basically what that means is that it is very straightforward. So once you have a certain round, if it's very straightforward, what's going to happen? Or you just look at the upper confidence bound, and whichever one has the highest, that's the one you pick. You pull the algorithm.

Whereas, on the other hand, the Thompson sampling algorithm is a probabilistic algorithm because in the algorithm itself, it has these distributions which represent our perception of the world and where we think the actual expected returns of each of those machines might lie, and therefore every time we are implementing or iterating in the Thompson sampling algorithm, we actually generate random values from those distributions.

So if you rerun a round in the UCB algorithm, just one given round, after you've received the previous value from the machine, and then you rerun the round, it's always going to be the same result.

whereas in the Thompson sampling algorithm, after you've received the previous value from the machine and you rerun the current round, it's always going to different because you're always sampling from your distributions which characterize your perception of the world.

One important one is that the UCB requires an update at every round. So basically the value that you get back from the machine, so once you've pulled the lever and you get a value back from that machine, that value, you have to incorporate it right away. Because if you don't make the adjustment,

then nothing changes and you're going to be stuck.

Whereas in the Thompson sampling, it can accommodate delayed feedback,

This basically means that, if you pull the lever and you only will get the result or you will only know the result of pulling the lever 500 rounds down the track, not right away, you'll only get to know the result 500 rounds later,

the Thompson sampling algorithm will still work. Why will it still work?

Because if you now run the algorithm without even updating your perception of the world, you're still going to get a new set of hypothetical bandits.

because this gives the Thompson sampling that advantage that you don't have to update the algorithm with the result every time.

Used Case

And in terms of bandits, of course that doesn't really matter that much, because if you're playing in the casino and if some hypothetical person is playing in the casino and they're pulling these levers, they get to see the results right away**, so they could update the algorithm**.So, you can use UCB

But in terms of websites and ads, that is a big deal. So not even just displaying ads on a website, or you could use this for AB testing different, instead of AB testing the different layouts of your website, you could use a Thompson sampling algorithm to have that balance between exploitation, exploration, right away.

So basically anything you're doing on the web with Thompson sampling or solving a multi-armed bandit problem for your business or for a business on the web, you're getting all these thousands and tens of thousands of clicks, and to update the algorithm right away, that would be very highly computationally costly, or it might require additional resources and a complex process.

So you wait until you get 500 clicks or you wait until you get 5,000 clicks, you update the algorithm. Then you let it (the ads) run, and then it runs, runs, runs, and then you get another 5,000 clicks and then you update the algorithm again.

Thompson sampling algorithm does work better than the UCB.

Let's code this down

Problem statement: We have 10k users who have watched 10 ads. Let's find the best ad to run

So, firstly install libraries and import dataset

As we are going to do some random stuffs , let's import it

import random

For 10k users

N = 10000

Number of ads

d = 10

Prepare the list of all of the ads

ads_selected = []

Now , let's apply the Thompson algorithms

Step 1:

Ni^1(n):

numbers_of_rewards_1=[0]*d

Ni^0(n):

numbers_of_reward_0=[0]*d

Others

total_reward=0

for n in range(0,N):

ad=0 #Add starting index

max_random=0 #this is what we will take at step 3

for i in range(0,d): #iterating through ads

#Step 2

Within the for loop,

random_beta = random.betavariate(numbers_of_rewards_1[i]+1,numbers_of_reward_0[i]+1)

Step 3

within the for loop,

if (random_beta>max_random):

max_random=random_beta

ad =i #to iterate though all ads starting from 0

Let's update the four variables:

ads_selected.append(ad) #adding all ads

now if we get reward 1, increment numbers_of_rewards_1

reward would be the value we got under ads (0 or 1)

so, reward = dataset.values[n,ad]

if reward==1:

numbers_of_rewards_1[ad]=numbers_of_rewards_1[ad]+1

if reward 0, increment numbers_of_reward_0

numbers_of_rewards_0[ad]=numbers_of_rewards_0[ad]+1

Updating the total reward

total_reward=total_reward+reward

Let's visualize now with N= 10000 user data

when N=1000, we get

if N=500, we got

haha!

Even if we take less user data, still we get the best ad index 4 earlier.

But in UCB , we could not get ad index 4 as best.

So, Thompson algorithm is really powerful than UCB

Code from here