Artificial Intelligence : Reinforcement Learning- Deep Q Learning (Part 36)

Deep Q-Learning is the result of combining Q-Learning with an Artificial Neural Network.

The states of the environment are encoded by a vector which is passed as input into the Neural Network. Then the Neural Network will try to predict which action should be played, by returning as outputs a Q-value for each of the possible actions.

Eventually, the best action to play is chosen by either taking the one that has the highest Q-value, or by selecting one at random with a strategy called epsilon-greedy, which is used for exploration.

Check out the Q Learning Blog before we start

So, now we will represent our puzzle in a 2D format and thus we can get to know our agent's location , goal's location and many more

Then we will feed this to a Neural Network

You can see we have axis X1 and X2 and we are feeding them as input in the neural network.

Just a reminder,

We were here (0,1) cell and then we had a value Q(s,a)

Then we moved to (0,2) and the value changed to R(s,a)+gamma*MaxQ(s prime, a prime)

Although both means the same,

The difference was , the time. The time has changed. For example, we were at cell (0,1) at 5 pm and on cell (0,2) at 10 pm.

So, this creates a temporal difference (TD)

So, using Neural Networks, we will get Q1,Q2,Q3,Q4

and we will compare them with the target Q1, target Q2, target Q3, target Q4

The neural network will learn by adapting the weights!

So, here the loss should be this

So the square difference of each one of these, and we're gonna sum them.

So we're gonna take this loss and we're going to use back propagation or stochastic gradient descent to take this loss and pass it through the network, update the weights of these synapses in the network so that next time we go through this network the weights already a bit better descriptive of the environment.

Then using the softmax function, we will get the best one

Experience Replay

So, assume this is a road for our self driving car and the white line is our car.

So, as the car moves most of things remain the same . What most of the things? For example, you are running on a desert and how much you run, still you will find a bunch of sand . No more greenery.

Again, talking about while going through road, your GPS location might change but most of time you will find building and building on the road. Although those are not exactly the same building but the type is same.We are not getting monsters on a sudden neither sea beach.

It's basically road and buildings and it's coming and coming and coming as long as we are driving.

So, here in the example, it's mostly a straight line to drive.

At this point, it will get some change the Neural network will modify/update itself on that .

and then it will drive again.

So, whatever happened, the model always adapted the change and then the car started driving.

It got changes in various points , adapted and started driving. But one important thing this, whenever it faced some changes, it adapted the weight and started driving, then it faced same type of road and so on, again it faced some changes and adapted. This process repeats untill we get to the final goal.

So, we can say that once it reaches a certain threshold, then the agent decides for itself, "Okay, it's time to learn." I have this batch of experiences that I have and now I'm going to learn from that batch. And so it randomly selects a uniformly distributed sample, so basically all experiences are considered to be equal.

It takes a uniformly distributed sample from that batch of experiences that it has and then it goes through them and it learns from them and that way it breaks the pattern of that bias which comes from the sequential nature of the experiences if you were to put them through the network one after the other.

So it came up in several batches, because a batch might be updated as a rolling window of experiences.

So the older experiences get kicked out, the newer experiences are added, and then again, older experience kicked out. It stays in the batch for quite some time and the car/agent can learn from that experience several times.

So, it's Experience Replay which gives you an opportunity to learn from more experiences than if you were just learning for one at a time,

because you have that batch and therefore, and it's a rolling window, and therefore, even if your environment is limited to experience, your Experience Replay approach can help you learn faster.

Action selection policies

Using the softmax function, we can surely get the best one . For example, here is Q2.
But what can happen is, once the Neural network gets a result (Q2), it might face a penalty for multiple times.

For, example, assuming we got Q2. Which means to go to left.

And as it's best assumed by them , we are getting that for 80% of the time.

But when we took the action, we started to get penalty. For example, it's -0.04.

As it's going to be here for 80% of time, our value is going to decrease 0.04 every single time.

Surely this is going to be a big loss for us and the total value will decrease.Now, we need to look for other options which initially had less chance (10%). This will lead us to change the direction.

So, you can realize that, using softmax might not a good fit as it chooses the action which has max probability (for example, 80%)

So, Q2 might not a be a good decision.So, it's important to explore other options like Q1, Q3,Q4)

To solve this issue, we can take these action functions:

-> Epsilon Greedy: When epsilon greedy function is used, it will give us the best action (80%) time and then it will surely explore other actions as well which has less chances (10%, 10%) etc.

So for instance, if you set epsilon to 10%, 90% of the time, you're still going to be selecting the best action based on the highest Q value (we get after apply the Neural network). But 10% of the time is gonna be selecting a random action.

if you said epsilon to 0.5 or 0.05, that means that 95% of the time the agent is gonna be taking the action with the highest Q value, but 5% of the time it's still going to be selecting in a random action.

-> Epsilon soft:

Epsilon soft is the opposite. So basically you're selecting, at random, you're selecting one minus epsilon percent of the time. So if your epsilon is like 0.1, so 10%, then only 10% of the time you're taking this action we get after apply the Neural network and 90% of the time you're selecting a random action.

--> Softmax:

After applying softmax, we get probabilities of a Q in the range between zero and one and that also add up to one.

as Q2 has the most probability, we will choose this one

We're gonna use these as our distribution and we're gonna say, "Okay we're gonna be taking Q2 90% of the time but 5% of the time we're still gonna be taking Q1 and 2% of the time we're gonna take Q3 and 3% of the time we're gonna be taking Q4."

And the beauty here is also that as these values update as the agent goes through the network more and more and more, it becomes more familiar with the environment and therefore the probabilities updates.

Even though here we've got Q2, nobody is to say that sometimes 5% of the time to be more precise, we'll be selecting Q1 as the action to take

Sometimes action 3, Q3

Sometimes action 4, Q4

Let's code this down

We are going use the land our agent on Lunar from Gymnasium

Gymnasium is a 3rd party website where you can install those and apply your AI on their games.

Install Gymnasium

!pip install gymnasium

!pip install "gymnasium[atari, accept-rom-license]"

!apt-get install -y swig

!pip install gymnasium[box2d]

Importing the libraries

import os

import random

import numpy as np #to work with mathematics

import torch #to import pytorch

import torch.nn as nn #for neural networks

import torch.optim as optim #to import optimizer

import torch.nn.functional as F #to use functions

import torch.autograd as autograd #for stochastic gradient descent

from torch.autograd import Variable #for torch variables

from collections import deque, namedtuple #used during the training

Building the AI

If we check the action space, we will see we have 4 actions

Also, the input is going to be a 8 dimensional vector

class Network(nn.Module):#creating a class Network

our state_size would be 8 (observation space) and action_size would be 4, seed =42 just to ensure randomness

def init(self,state_size,action_size, seed =42):

super(Network, self).__init__() #just to activate inheritence

self.seed = torch.manual_seed(seed) # just to generate some random vectors

Now, we will start building the main part

the first variable I'm creating here is going to be fc1, representing the first full connection between the input layer and the first fully connected layer.

self.fc1 = nn.Linear(neuron size,optimal number of neurons)

We are just guessing, optimal number of neurons to be 64, we will change if needed

self.fc1 = nn.Linear(state_size, 64)

Then the second fully connected layer

self.fc2 = nn.Linear(number of neurons in 1st one, optimal number of neurons for second layer)

self.fc2 = nn.Linear(64, 64)

Third fully connected layer that will be connected to output. As we need action_size as output (4 actions possible) which we want to know from Neural network, we will keep that at output neuron

self.fc3 = nn.Linear(second fully connected layer neurons, output neuron)

self.fc3 = nn.Linear(64, action_size)

Now, to to propagate the signal from the state to the output layer, let's build a function

def forward(self, state):

fc1 state actually returns the first fully connected layer

so, x=self.fc1(state)

Then from this functional module we're gonna call one of its functions, which is the relu function, representing, of course, the rectifier activation function.

x = F.relu(self.fc1(state)

This will actually propagate the signal from the input layer to the first fully connected layer with a rectifier activation function.

let's connect first fully connected layer to the second fully connected layer.

so, it takes input from 1st layer and then pass it

x = F.relu(self.fc2(x))

And then we call our third full connection fc3, which takes as input, of course, x,which is now fully activatedwith our rectifier activation functionthrough the first fully connected layer first,and then the second fully connected layer.And there we go.

return self.fc3(x)

That forward propagates the signal from the input layer containing the state to the output layer containing our actions,and we're done basically creating the architecture of our Neural Network.

Train the AI

Importing gymnasium

import gymnasium as gym #importing gymnasium

We are going to import this one

env = gym.make('LunarLander-v2')

then adding state_shape, state_size and number_actions

state_shape= env.observation_space.shape

state_size= env.observation_space.shape[0] #the number of elements in this input state.

number_actions = env.action_space.n #number of actions

Initializing the hyperparameters

learning rate

learning_rate= 5e-4

Then minibatch_size refers of course to the number of observations used in one step of the training to update the model parameters

minibatch_size = 100

discount factor/gamma

discount_factor = 0.99

memory of the AI

replay_buffer_size= int(1e5)

interpolation parameter used for the training

interpolation_parameter = 0.001

Implementing Experience Replay

initialize the class and initi

class ReplayMemory(object):

def init(self, capacity): #capacity= capacity of the memory

cuda is going to check if we have GPU or CPU to process. It helps to make the process faster

self.device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

self.capacity = capacity #capacity variable

self.memory = [] #the list that will store the experiences, each one containing the state, the action, the reward, the next state, and whether we are done or not.

self.position = 0

Now we will create a method that will add those experiences into this replay memory buffer while also checking that we don't exceed the capacity

def push(self, event):

Event is what basically contains the state, the action, the next state, the reward, and that Boolean done saying whether we are done or not

self.memory.append(event) #append an event

if len(self.memory) > self.capacity: #make sure it does not exceed the capacity

del self.memory[0] #delete the oldest event

Then we will randomly select a batch of experiences from the memory buffer using sample method

def sample(self, batch_size):

experiences= random.sample(self.memory, k=batch_size) #we want to sample the experience from self.memory and it's going to be the number of experiences we want to have in the batch which is batch size

And so we're gonna extract and stack those elements one by one in states

#states

states=torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)

#e[0] is the first element from the experiences; make sure, e is not None; Then we need to convert them to pytorch tensors by "torch.from_numpy",.float() to make them float; here .to(self.device) to move this to designated CPU or GPU

#actions

actions=torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)

As actions can be 0,1,2,3 so, we can't make them float.Need long integer

#rewards

rewards=torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device) # same as states

#next_states

next_states=torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device) #same as states

#Done

dones=torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device) #same as states, .astype(np.uint8) to mean boolian data type and then convert them to float

return states,next_states,actions,rewards,dones

Implementing the DQN class

class Agent(): #creating our agent

def init(self,state_size,action_size):

self.device=torch.device("cuda" if torch.cuda.is_available() else "cpu")# this is to use GPU or CPU; make computation faster

self.state_size=state_size

self.action_size=action_size

#Two Q networks

self.local_qnetwork=Network(state_size,action_size).to(self.device) #creating the local network, to(self.device) to choose CPU or GPU

self.target_qnetwork=Network(state_size,action_size).to(self.device) #creating the target network

#optimizer

self.optimizer=optim.Adam( self.local_qnetwork.parameters(),lr=learning_rate)

here, parameters() = which are exactly the weights of the network, meaning what will update step by step to predict better and better actions to play in order to land properly on the moon.

self.memory=ReplayMemory(replay_buffer_size) #creating the memory ; replay_buffer_size is the capacity

#timestep

self.t_step=0 #step counter

Now, Step method: And this is a method that will store experiences and decide when to learn from them

def step(self,state,action,reward,next_state,done):

store experience in replaymemory

self.memory.push((state,action,reward,next_state,done))

then, Increment the time step counter, which is one of our object variables here, self.t_step. We're gonna increment this time step counter and reset it every four steps, so that we can learn every four steps

self.t_step=(self.t_step+1)%4 #We're gonna increment this time step counter and reset it every four steps

Now, check if we have reached a new four steps

if self.t_step==0:

And so if that's the case, then what are we gonna do? Well, we're gonna learn, because we want to learn every four steps. But then remember that when we learn, we don't learn on one observation only.

We actually learn on a minibatch of observations. That's why we created the minibatch variable before, which we initialized to 100.

if len(self.memory.memory)> minibatch_size:#memory size of our memory len(self.memory.memory), self.memory is the instance of ReplayMemory and later memory is the attribute of those.

experiences= self.memory.sample(minibatch_size) #this will sample 100 experiences from the memory

#learn from experience

self.learn(experiences,discount_factor)

Here, Act method--> that will select an action based on a given state in the environment.

def act(self,state,epsilon=0.): #0. to mean float

state= torch.from_numpy(state).float().unsqueeze(0).to(self.device) #convert the state; torch tensor will be at the beginning; all of these values updates the state

unsqueeze(0) = we need to add an extra dimension which will correspond to the batch, meaning that this extra dimension will say which batch this state belongs to.

#Local network to evaluate

self.local_qnetwork.eval()

#to check we are in the inference mode (predicting mode)

with torch.no_grad(): #any gradiant computation is disabled

action_values= self.local_qnetwork(state) #action_values: which will be of course the actions predicted

#training mode

self.local_qnetwork.train() #set to training mode

#epsilon greedy action selection policy

if random.random() > epsilon: #we're gonna select the action with the highest Q value.

return np.argmax(action_values.cpu().data.numpy()) #argmax function will take actions as input,as selection is simple, we are sending this operation to be held in CPU

else: #we're gonna select a random action.

return random.choice(np.arange(self.action_size))

Learn method that will update the agent's Q values based on sample experiences

def learn(self,experiences,discount_factor):

unpack our sampled experiences into their respective categories. Meaning states, next states, actions, rewards, and dones.

states, next_states, actions, rewards, dones = experiences

Get the maximum predicted Q values (for next states) from target model

next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)

# detach() the action values in the tensor in order to then get the maximum of them, max(1) meaning we need max value on dimension 1, [0] to select the maximum Q values tensor, .unsqueeze(1) add a dimenstion of the batch

#Compute Q targets for current states

q_targets = rewards + (discount_factor next_q_targets(1-dones))

#now find expected Q values from local Q network

q_expected= self.local_qnetwork(states).gather(1,actions)#gather all respected Q values

#compute loss

loss = F.mse_loss(q_expected,q_targets)#mse= mean squared error loss,

#minimize the loss& back poropagate

self.optimizer.zero_grad() #to reset it from Adam() instance to zero, use zero_grad() loss.backward() #back propagate

self.optimizer.step() #update the model parameters self.soft_update(self.local_qnetwork,self.target_qnetwork,interpolation_parameter)#update the target network parameters

#update the target network parameters

def soft_update(self,local_model,target_model,interpolation_parameter):

#loop through local and target parameters

for target_param, local_param in zip(target_model.parameters(),local_model.parameters()):

#soft update consists of well softly update the target model parameters using the weighted average of the local and target parameters

target_param.data.copy_(interpolation_parameter*local_param.data + (1.0-interpolation_parameter)*target_param.data)

Initializing the DQN agent

agent= Agent(state_size,number_actions)

Training the DQN agent

number_episodes= 2000 #number of episodes ; which is actually the maximum number of episodes over which we want to train our agent.

#the maximum number of times steps per episode

max_number_timesteps_per_episode = 1000 #In any attempt on landing on the moon, there's gonna be maximum 1000 times steps.

#Reduce epsilon little by little to test other epsilon values till 0.01; the goal is to check all effect of epsilon

epsilon_starting_value=1.0

epsilon_ending_value= 0.01

epsilon_decay_value = 0.995 # it will help decaying epsilon . for example 1* 0.995= 0.995

epsilon= epsilon_starting_value

#window of scores on 100 episodes

scores_on_100_episodes= deque(maxlen=100)# double-ended queue

#main

for episode in range(1,number_episodes+1): #from first episode to last

#reset the environment

state, = env.reset() #reset environment to initial state, state gets initial state and , gets some other info which is not needed

#initialize score

score=0

#loop over timesteps

for t in range(max_number_timesteps_per_episode):

#select an action

action= agent.act(state,epsilon)

#once it takes an action, it moves to a new state, get rewards etc

next_state,reward,done,_,_= env.step(action)

#training

agent.step(state,action,reward,next_state,done)

#now change the state to new

state=next_state

#update score

score+=reward

# if the episode is done at this specific time step, well we'll simply do a break,

if done:

break

# append the score of that finished episode to that window of the scores on 100 episodes

scores_on_100_episodes.append(score)

#reduce epsilon

epsilon= max(epsilon_ending_value,epsilon_decay_value*epsilon)

print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode,np.mean(scores_on_100_episodes)), end="")# episode has value of episode #average score np.mean(scores_on_100_episodes) #\r will create dynamic effect, end="" will ensure we don't go to new line

if episode % 100 == 0: # that means that we are every 100 episode

print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))

if np.mean(scores_on_100_episodes) >= 200.0: # if the average scores_on_100_episodes is larger than 200, well time to say we win

print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode-100, np.mean(scores_on_100_episodes)))# you can keep episode or put episode-100 #np.mean(scores_on_100_episodes)if this average of the scores_on_100_episodes is larger than 200, that means that in fact, we actually started winning from this episode number minus 100 because this is a score over 100 episodes. torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth') #save the model parameters

break

Then finally, we visualize the result (Check that from the code)

Run all of the cells

Read more

Arthur Juliani, 2016, Simple Reinforcement Learning with Tensorflow (Part 4)
Tom Schaul et al., Google DeepMind, 2016, Prioritized Experience Replay
Michel Tokic, 2010, Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences