Machine Learning : Reinforcement Learning - Solving Multi Armed Bandit Problem with UCB (Part 23)
Assume that we have a robotic dog
And we have designed it so that, when it does tasks we mention, we give it treat (return 1) and if not, don't give any treat (return 0)
Basically this is how we train reinforcement models
Multi Armed Bandit Problem
Here is a one armed bandit
basically used in casino
Multi armed bandit:
Assume that these machines have been assigned some code
Also, we have some performance set for these. Like which has good chance to get a big loot. Our target is to know the performance distribution (which to work and which not to) so that we can get some awesome loots.
Assume this is the performance of these bandits:
So, assume a situation. Karim came to Las Vegas.
He then went to one of the hotels which has this machines to try his luck. He is determined to get a big loot and trying 5 machines to check his luck.
He is spending money on then and analyzing their performance. Now assume he has tested performance of just D2 and D3 and
So, he founds out the D4 is better but is he actually correct?
Technically No. Because he has not explored other options.
So, he has to be quick and look for every option and when he finds the best one, he needs to use that machine to get a big loot.
Now, let's look for another example:
Let's look for the best ads by CocaCola
Note: I never promote CocaCola . Kindly avoid them.
Now , we got ads but how to know their performance or distribution?
Let's test at a large scale to find the data.
The challenge is to find the best ads while keeping on the test and quickly.
Upper confidence Bound (UCB)
So, we will solve the problem using UCB
Let's see what the problem statement is:
The algo:
How it works?
Assume that this is the performance for the ads
But we don't know it prior to complete testing.
Let's take the x axis value to y axis here
Let's assume a starting point (same for everyone)
Now, assume they have confidence band. These bands will have the actual performance (colorful lines) lines within them
we will now choose one with upper confidence bound. Currently they are all same.
Assume we chose this one
Now , assume that we have run ads D3 and checking if a person click on the ads or not.
We got that, the person did not click it.
so, the red line goes down near the pink line alongside the band
Now, as we have an observation now , we are more confident and thus we reduce the size to become more accurate.
Now, assume we are running ad D4 and the user again did not like it.
So, the red line went near the green line
now we got an observation and thus the confidence band will shrink.
Let's pick D1 ads now
Now assume that our observer watched the ads and so, the red line went far from the blue line
And due to the observation, the confidence band shrinks
Now the ads D2
and we saw the person did not click. So, the red line will go closer to the purple line and shrinks due to the observation
now, let's choose D5 ads
now assume that, no one clicked the ads again and the red line went closer to yellow line and shrinks
Now you can see D5 is the best one till now (it's upper than others)
So, we will run this ads again.
So, again the red line went closer to the yellow line and shrinked
now D1 seems best
red line went closer and shrinked
Then D4
Then D5
again D5 and shrink
Again, D5 and shrink
then D3 and shirnk
then D5 And shrink
Finally, we can see the D5 upper bound is more upper and we will choose this ads as an answer.
Let's code this down:
Problem statement
Assume that we have 10 ads
Assume every user has been shown 10 ads and if he liked it, we have 1 else 0
So, we have 10k users who have checked 10 of our ads
Remember, each time we show an ad, wee need to cost money . So, we need to find the best ads as fast as possible.
Let's import the libraries
as we don't need to provide input just like other ML models, we will just take the dataset.
Implement UCB
Remember the algo
Step 1:
firstly selecting total features/users
N=100000
#10k user
Total ads
d=10
#10 ads
ads_selected=[]
#full list of ads selected finally
for Ni(n), we need numbers of time an ad was selected.
numbers_of_selections=[0]*d
#creating the list with 10 0's
Ri(n): sums of rewards
sums_of_rewards=[0]*d
#creating the list with 10 0's
Keeping total reward
total_reward=0
Now, we will iterating through all of the users and ads.
for n in range(0,N):
#looping all rows
ad=0
#index of first ad is 0, Ad 1 has index 0, Ad 2 has index 1 etc....
max_upper_bound=0 #max upper bound to be used later
for i in range(0,d): #looping all column
Step 2
we want to deal with ads those have been selected as Ni(n) can't be 0 (1/0)
if(numbers_of_selections[i]>0): #checking if one was selecting more than 0
average_reward=sums_of_rewards[i]/numbers_of_selections[i]
#ri(n)=Ri(N))/ Ni(N)))
now,
delta_i=math.sqrt(3/2*math.log(n+1)/numbers_of_selections[i])
math.log(n+1)math.log(n+1) # log(n+1)) as we have n with 0. log can't have 0
Step 3
upper_bound=average_reward+delta_i #upper bound
Remaining other lines
else: #ads that has not been selected
upper_bound=1e400 #setting one (always larger than max_upper bound)
#Step 4
if(upper_bound>max_upper_bound):
max_upper_bound=upper_bound
ad=i #ad seleting
ads_selected.append(ad)#adding ads selected
numbers_of_selections[ad]=numbers_of_selections[ad]+1 #adding 1 to the ad
reward=dataset.values[n,ad] sums_of_rewards[ad]=sums_of_rewards[ad]+reward #adding the reward from the dataset using 2d matrix total_reward=total_reward+reward #adding the total reward
let's plot it
now, we need to check by reducing N, with how less value of N, we still get index 4 as the best one.
let's change N = 5000
Still we got the same plot
if N = 1000
we still get index 4 as top.
What about N = 500?
Now , it's very impossible to find index 4 as top ads
So, we will need to check more than 500 user data to verify the best ads for our business.
Remember, we need the best ads and also need to find it quickly using the optimal number of users used. Because we are spending money and isn't it better to generate 10 ads among 550 users than 10,000 users?
Surely 550 users will cost us less.
Complete the code: Repository