UCB reinforcement learning

Assume that we have a robotic dog

And we have designed it so that, when it does tasks we mention, we give it treat (return 1) and if not, don't give any treat (return 0)

Basically this is how we train reinforcement models

Multi Armed Bandit Problem

Here is a one armed bandit

basically used in casino

Multi armed bandit:

Assume that these machines have been assigned some code

Also, we have some performance set for these. Like which has good chance to get a big loot. Our target is to know the performance distribution (which to work and which not to) so that we can get some awesome loots.

Assume this is the performance of these bandits:

So, assume a situation. Karim came to Las Vegas.

He then went to one of the hotels which has this machines to try his luck. He is determined to get a big loot and trying 5 machines to check his luck.

He is spending money on then and analyzing their performance. Now assume he has tested performance of just D2 and D3 and

So, he founds out the D4 is better but is he actually correct?

Technically No. Because he has not explored other options.

So, he has to be quick and look for every option and when he finds the best one, he needs to use that machine to get a big loot.

Now, let's look for another example:

Let's look for the best ads by CocaCola

Note: I never promote CocaCola . Kindly avoid them.

Now , we got ads but how to know their performance or distribution?

Let's test at a large scale to find the data.

The challenge is to find the best ads while keeping on the test and quickly.

Upper confidence Bound (UCB)

So, we will solve the problem using UCB

Let's see what the problem statement is:

The algo:

How it works?

Assume that this is the performance for the ads

But we don't know it prior to complete testing.

Let's take the x axis value to y axis here

Let's assume a starting point (same for everyone)

Now, assume they have confidence band. These bands will have the actual performance (colorful lines) lines within them

we will now choose one with upper confidence bound. Currently they are all same.

Assume we chose this one

Now , assume that we have run ads D3 and checking if a person click on the ads or not.

We got that, the person did not click it.

so, the red line goes down near the pink line alongside the band

Now, as we have an observation now , we are more confident and thus we reduce the size to become more accurate.

Now, assume we are running ad D4 and the user again did not like it.

So, the red line went near the green line

now we got an observation and thus the confidence band will shrink.

Let's pick D1 ads now

Now assume that our observer watched the ads and so, the red line went far from the blue line

And due to the observation, the confidence band shrinks

Now the ads D2

and we saw the person did not click. So, the red line will go closer to the purple line and shrinks due to the observation

now, let's choose D5 ads

now assume that, no one clicked the ads again and the red line went closer to yellow line and shrinks

Now you can see D5 is the best one till now (it's upper than others)

So, we will run this ads again.

So, again the red line went closer to the yellow line and shrinked

now D1 seems best

red line went closer and shrinked

Then D4

Then D5

again D5 and shrink

Again, D5 and shrink

then D3 and shirnk

then D5 And shrink

Finally, we can see the D5 upper bound is more upper and we will choose this ads as an answer.

Let's code this down:

Problem statement

Assume that we have 10 ads

Assume every user has been shown 10 ads and if he liked it, we have 1 else 0

So, we have 10k users who have checked 10 of our ads

Remember, each time we show an ad, wee need to cost money . So, we need to find the best ads as fast as possible.

Let's import the libraries

as we don't need to provide input just like other ML models, we will just take the dataset.

Implement UCB

Remember the algo

Step 1:

firstly selecting total features/users

N=100000 #10k user

Total ads

d=10 #10 ads

ads_selected=[] #full list of ads selected finally

for Ni(n), we need numbers of time an ad was selected.

numbers_of_selections=[0]*d #creating the list with 10 0's

Ri(n): sums of rewards

sums_of_rewards=[0]*d #creating the list with 10 0's

Keeping total reward

total_reward=0

Now, we will iterating through all of the users and ads.

for n in range(0,N): #looping all rows

ad=0 #index of first ad is 0, Ad 1 has index 0, Ad 2 has index 1 etc....

max_upper_bound=0 #max upper bound to be used later

for i in range(0,d): #looping all column

Step 2

we want to deal with ads those have been selected as Ni(n) can't be 0 (1/0)

if(numbers_of_selections[i]>0): #checking if one was selecting more than 0

average_reward=sums_of_rewards[i]/numbers_of_selections[i] #ri(n)=Ri(N))/ Ni(N)))

now,

delta_i=math.sqrt(3/2*math.log(n+1)/numbers_of_selections[i])

math.log(n+1)math.log(n+1) # log(n+1)) as we have n with 0. log can't have 0

Step 3

upper_bound=average_reward+delta_i #upper bound

Remaining other lines

else: #ads that has not been selected

upper_bound=1e400 #setting one (always larger than max_upper bound)

#Step 4

if(upper_bound>max_upper_bound):

max_upper_bound=upper_bound

ad=i #ad seleting

ads_selected.append(ad)#adding ads selected

numbers_of_selections[ad]=numbers_of_selections[ad]+1 #adding 1 to the ad

reward=dataset.values[n,ad] sums_of_rewards[ad]=sums_of_rewards[ad]+reward #adding the reward from the dataset using 2d matrix total_reward=total_reward+reward #adding the total reward

let's plot it

now, we need to check by reducing N, with how less value of N, we still get index 4 as the best one.

let's change N = 5000

Still we got the same plot

if N = 1000

we still get index 4 as top.

What about N = 500?

Now , it's very impossible to find index 4 as top ads

So, we will need to check more than 500 user data to verify the best ads for our business.

Remember, we need the best ads and also need to find it quickly using the optimal number of users used. Because we are spending money and isn't it better to generate 10 ads among 550 users than 10,000 users?

Surely 550 users will cost us less.

Complete the code: Repository

Machine Learning : Reinforcement Learning - Solving Multi Armed Bandit Problem with UCB (Part 23)

Upper confidence Bound (UCB)

Implement UCB