Artificial Intelligence : Reinforcement Learning- Deep Q Learning (Part 36)
Deep Q-Learning is the result of combining Q-Learning with an Artificial Neural Network.
The states of the environment are encoded by a vector which is passed as input into the Neural Network. Then the Neural Network will try to predict which action should be played, by returning as outputs a Q-value for each of the possible actions.
Eventually, the best action to play is chosen by either taking the one that has the highest Q-value, or by selecting one at random with a strategy called epsilon-greedy, which is used for exploration.
Check out the Q Learning Blog before we start
So, now we will represent our puzzle in a 2D format and thus we can get to know our agent's location , goal's location and many more
Then we will feed this to a Neural Network
You can see we have axis X1 and X2 and we are feeding them as input in the neural network.
Just a reminder,
We were here (0,1) cell and then we had a value Q(s,a)
Then we moved to (0,2) and the value changed to R(s,a)+gamma*MaxQ(s prime, a prime)
Although both means the same,
The difference was , the time. The time has changed. For example, we were at cell (0,1) at 5 pm and on cell (0,2) at 10 pm.
So, this creates a temporal difference (TD)
So, using Neural Networks, we will get Q1,Q2,Q3,Q4
and we will compare them with the target Q1, target Q2, target Q3, target Q4
The neural network will learn by adapting the weights!
So, here the loss should be this
So the square difference of each one of these, and we're gonna sum them.
So we're gonna take this loss and we're going to use back propagation or stochastic gradient descent to take this loss and pass it through the network, update the weights of these synapses in the network so that next time we go through this network the weights already a bit better descriptive of the environment.
Then using the softmax function, we will get the best one
Experience Replay
So, assume this is a road for our self driving car and the white line is our car.
So, as the car moves most of things remain the same . What most of the things? For example, you are running on a desert and how much you run, still you will find a bunch of sand . No more greenery.
Again, talking about while going through road, your GPS location might change but most of time you will find building and building on the road. Although those are not exactly the same building but the type is same.We are not getting monsters on a sudden neither sea beach.
It's basically road and buildings and it's coming and coming and coming as long as we are driving.
So, here in the example, it's mostly a straight line to drive.
At this point, it will get some change the Neural network will modify/update itself on that .
and then it will drive again.
So, whatever happened, the model always adapted the change and then the car started driving.
It got changes in various points , adapted and started driving. But one important thing this, whenever it faced some changes, it adapted the weight and started driving, then it faced same type of road and so on, again it faced some changes and adapted. This process repeats untill we get to the final goal.
So, we can say that once it reaches a certain threshold, then the agent decides for itself, "Okay, it's time to learn." I have this batch of experiences that I have and now I'm going to learn from that batch. And so it randomly selects a uniformly distributed sample, so basically all experiences are considered to be equal.
It takes a uniformly distributed sample from that batch of experiences that it has and then it goes through them and it learns from them and that way it breaks the pattern of that bias which comes from the sequential nature of the experiences if you were to put them through the network one after the other.
So it came up in several batches, because a batch might be updated as a rolling window of experiences.
So the older experiences get kicked out, the newer experiences are added, and then again, older experience kicked out. It stays in the batch for quite some time and the car/agent can learn from that experience several times.
So, it's Experience Replay which gives you an opportunity to learn from more experiences than if you were just learning for one at a time,
because you have that batch and therefore, and it's a rolling window, and therefore, even if your environment is limited to experience, your Experience Replay approach can help you learn faster.
Action selection policies
Using the softmax function, we can surely get the best one . For example, here is Q2.
But what can happen is, once the Neural network gets a result (Q2), it might face a penalty for multiple times.
For, example, assuming we got Q2. Which means to go to left.
And as it's best assumed by them , we are getting that for 80% of the time.
But when we took the action, we started to get penalty. For example, it's -0.04.
As it's going to be here for 80% of time, our value is going to decrease 0.04 every single time.
Surely this is going to be a big loss for us and the total value will decrease.Now, we need to look for other options which initially had less chance (10%). This will lead us to change the direction.
So, you can realize that, using softmax might not a good fit as it chooses the action which has max probability (for example, 80%)
So, Q2 might not a be a good decision.So, it's important to explore other options like Q1, Q3,Q4)
To solve this issue, we can take these action functions:
-> Epsilon Greedy: When epsilon greedy function is used, it will give us the best action (80%) time and then it will surely explore other actions as well which has less chances (10%, 10%) etc.
So for instance, if you set epsilon to 10%, 90% of the time, you're still going to be selecting the best action based on the highest Q value (we get after apply the Neural network). But 10% of the time is gonna be selecting a random action.
if you said epsilon to 0.5 or 0.05, that means that 95% of the time the agent is gonna be taking the action with the highest Q value, but 5% of the time it's still going to be selecting in a random action.
-> Epsilon soft:
Epsilon soft is the opposite. So basically you're selecting, at random, you're selecting one minus epsilon percent of the time. So if your epsilon is like 0.1, so 10%, then only 10% of the time you're taking this action we get after apply the Neural network and 90% of the time you're selecting a random action.
--> Softmax:
After applying softmax, we get probabilities of a Q in the range between zero and one and that also add up to one.
as Q2 has the most probability, we will choose this one
We're gonna use these as our distribution and we're gonna say, "Okay we're gonna be taking Q2 90% of the time but 5% of the time we're still gonna be taking Q1 and 2% of the time we're gonna take Q3 and 3% of the time we're gonna be taking Q4."
And the beauty here is also that as these values update as the agent goes through the network more and more and more, it becomes more familiar with the environment and therefore the probabilities updates.
Even though here we've got Q2, nobody is to say that sometimes 5% of the time to be more precise, we'll be selecting Q1 as the action to take
Sometimes action 3, Q3
Sometimes action 4, Q4
Let's code this down
We are going use the land our agent on Lunar from Gymnasium
Gymnasium is a 3rd party website where you can install those and apply your AI on their games.
Install Gymnasium
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]
Importing the libraries
import os
import random
import numpy as np #to work with mathematics
import torch #to import pytorch
import torch.nn as nn #for neural networks
import torch.optim as optim #to import optimizer
import torch.nn.functional as F #to use functions
import torch.autograd as autograd #for stochastic gradient descent
from torch.autograd import Variable #for torch variables
from collections import deque, namedtuple #used during the training
Building the AI
If we check the action space, we will see we have 4 actions
Also, the input is going to be a 8 dimensional vector
class Network(nn.Module):#creating a class Network
our state_size would be 8 (observation space) and action_size would be 4, seed =42 just to ensure randomness
def init(self,state_size,action_size, seed =42):
super(Network, self).__init__() #just to activate inheritence
self.seed = torch.manual_seed(seed) # just to generate some random vectors
Now, we will start building the main part
the first variable I'm creating here is going to be fc1, representing the first full connection between the input layer and the first fully connected layer.
self.fc1 = nn.Linear(neuron size,optimal number of neurons)
We are just guessing, optimal number of neurons to be 64, we will change if needed
self.fc1 = nn.Linear(state_size, 64)
Then the second fully connected layer
self.fc2 = nn.Linear(number of neurons in 1st one, optimal number of neurons for second layer)
self.fc2 = nn.Linear(64, 64)
Third fully connected layer that will be connected to output. As we need action_size as output (4 actions possible) which we want to know from Neural network, we will keep that at output neuron
self.fc3 = nn.Linear(second fully connected layer neurons, output neuron)
self.fc3 = nn.Linear(64, action_size)
Now, to to propagate the signal from the state to the output layer, let's build a function
def forward(self, state):
fc1 state actually returns the first fully connected layer
so, x=self.fc1(state)
Then from this functional module we're gonna call one of its functions, which is the relu function, representing, of course, the rectifier activation function.
x = F.relu(self.fc1(state)
This will actually propagate the signal from the input layer to the first fully connected layer with a rectifier activation function.
let's connect first fully connected layer to the second fully connected layer.
so, it takes input from 1st layer and then pass it
x = F.relu(self.fc2(x))
And then we call our third full connection fc3, which takes as input, of course, x,which is now fully activatedwith our rectifier activation functionthrough the first fully connected layer first,and then the second fully connected layer.And there we go.
return self.fc3(x)
That forward propagates the signal from the input layer containing the state to the output layer containing our actions,and we're done basically creating the architecture of our Neural Network.
Train the AI
Importing gymnasium
import gymnasium as gym #importing gymnasium
We are going to import this one
env = gym.make('LunarLander-v2')
then adding state_shape, state_size and number_actions
state_shape= env.observation_space.shape
state_size= env.observation_space.shape[0] #the number of elements in this input state.
number_actions = env.action_space.n #number of actions
Initializing the hyperparameters
learning rate
learning_rate= 5e-4
Then minibatch_size refers of course to the number of observations used in one step of the training to update the model parameters
minibatch_size = 100
discount factor/gamma
discount_factor = 0.99
memory of the AI
replay_buffer_size= int(1e5)
interpolation parameter used for the training
interpolation_parameter = 0.001
Implementing Experience Replay
initialize the class and initi
class ReplayMemory(object):
def init(self, capacity): #capacity= capacity of the memory
cuda is going to check if we have GPU or CPU to process. It helps to make the process faster
self.device=torch.device("cuda" if
torch.cuda.is
_available() else "cpu")
self.capacity = capacity
#capacity variable
self.memory = []
#the list that will store the experiences, each one containing the state, the action, the reward, the next state, and whether we are done or not.
self.position = 0
Now we will create a method that will add those experiences into this replay memory buffer while also checking that we don't exceed the capacity
def push(self, event):
Event is what basically contains the state, the action, the next state, the reward, and that Boolean done saying whether we are done or not
self.memory.append(event)
#append an event
if len(self.memory) > self.capacity:
#make sure it does not exceed the capacity
del self.memory[0]
#delete the oldest event
Then we will randomly select a batch of experiences from the memory buffer using sample method
def sample(self, batch_size):
experiences= random.sample(self.memory, k=batch_size)
#we want to sample the experience from self.memory and it's going to be the number of experiences we want to have in the batch which is batch size
And so we're gonna extract and stack those elements one by one in states
#states
states=torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
#e[0] is the first element from the experiences; make sure, e is not None; Then we need to convert them to pytorch tensors by "torch.from_numpy",.float() to make them float; here .to(self.device) to move this to designated CPU or GPU
#actions
actions=torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
As actions can be 0,1,2,3 so, we can't make them float.Need long integer
#rewards
rewards=torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
# same as states
#next_states
next_states=torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
#same as states
#Done
dones=torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
#same as states, .astype(np.uint8) to mean boolian data type and then convert them to float
return states,next_states,actions,rewards,dones
Implementing the DQN class
class Agent(): #creating our agent
def init(self,state_size,action_size):
self.device=torch.device("cuda" if
torch.cuda.is
_available() else "cpu")
# this is to use GPU or CPU; make computation faster
self.state_size=state_size
self.action_size=action_size
#Two Q networks
self.local_qnetwork=Network(state_size,action_size).to(self.device)
#creating the local network, to(self.device) to choose CPU or GPU
self.target
_qnetwork=Network(state_size,action_size).to(self.device)
#creating the target network
#optimizer
self.optimizer=optim.Adam( self.local_qnetwork.parameters(),lr=learning_rate)
here, parameters() = which are exactly the weights of the network, meaning what will update step by step to predict better and better actions to play in order to land properly on the moon.
self.memory=ReplayMemory(replay_buffer_size)
#creating the memory ; replay_buffer_size is the capacity
#timestep
self.t_step=0
#step counter
Now, Step method: And this is a method that will store experiences and decide when to learn from them
def step(self,state,action,reward,next_state,done):
store experience in replaymemory
self.memory.push((state,action,reward,next_state,done))
then, Increment the time step counter, which is one of our object variables here, self.t_step. We're gonna increment this time step counter and reset it every four steps, so that we can learn every four steps
self.t_step=(self.t_step+1)%4
#We're gonna increment this time step counter and reset it every four steps
Now, check if we have reached a new four steps
if self.t_step==0:
And so if that's the case, then what are we gonna do? Well, we're gonna learn, because we want to learn every four steps. But then remember that when we learn, we don't learn on one observation only.
We actually learn on a minibatch of observations. That's why we created the minibatch variable before, which we initialized to 100.
if len(self.memory.memory)> minibatch_size:
#memory size of our memory len(self.memory.memory), self.memory is the instance of ReplayMemory and later memory is the attribute of those.
experiences= self.memory.sample(minibatch_size)
#this will sample 100 experiences from the memory
#learn from experience
self.learn(experiences,discount_factor)
Here, Act method--> that will select an action based on a given state in the environment.
def act(self,state,epsilon=0.):
#0. to mean float
state= torch.from_numpy(state).float().unsqueeze(0).to(self.device)
#convert the state; torch tensor will be at the beginning; all of these values updates the state
unsqueeze(0) = we need to add an extra dimension which will correspond to the batch, meaning that this extra dimension will say which batch this state belongs to.
#Local network to evaluate
self.local_qnetwork.eval()
#to check we are in the inference mode (predicting mode)
with
torch.no
_grad():
#any gradiant computation is disabled
action_values= self.local_qnetwork(state)
#action_values: which will be of course the actions predicted
#training mode
self.local_qnetwork.train()
#set to training mode
#epsilon greedy action selection policy
if random.random() > epsilon:
#we're gonna select the action with the highest Q value.
return np.argmax(action_values.cpu().data.numpy())
#argmax function will take actions as input,as selection is simple, we are sending this operation to be held in CPU
else: #we're gonna select a random action.
return random.choice(np.arange(self.action_size))
Learn method that will update the agent's Q values based on sample experiences
def learn(self,experiences,discount_factor):
unpack our sampled experiences into their respective categories. Meaning states, next states, actions, rewards, and dones.
states, next_states, actions, rewards, dones = experiences
Get the maximum predicted Q values (for next states) from target model
next_q_targets =
self.target
_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
# detach() the action values in the tensor in order to then get the maximum of them, max(1) meaning we need max value on dimension 1, [0] to select the maximum Q values tensor, .unsqueeze(1) add a dimenstion of the batch
#Compute Q targets for current states
q_targets = rewards + (discount_factor next_q_targets(1-dones))
#now find expected Q values from local Q network
q_expected= self.local_qnetwork(states).gather(1,actions)
#gather all respected Q values
#compute loss
loss = F.mse_loss(q_expected,q_targets)
#mse= mean squared error loss,
#minimize the loss& back poropagate
self.optimizer.zero
_grad()
#to reset it from Adam() instance to zero, use zero_grad() loss.backward()
#back propagate
self.optimizer.step()
#update the model parameters self.soft_update(self.local_qnetwork,
self.target
_qnetwork,interpolation_parameter)
#update the target network parameters
#update the target network parameters
def soft_update(self,local_model,target_model,interpolation_parameter):
#loop through local and target parameters
for target_param, local_param in zip(target_model.parameters(),local_model.parameters()):
#soft update consists of well softly update the target model parameters using the weighted average of the local and target parameters
target_
param.data
.copy_(interpolation_parameter*local_
param.data
+ (1.0-interpolation_parameter)*target_
param.data
)
Initializing the DQN agent
agent= Agent(state_size,number_actions)
Training the DQN agent
number_episodes= 2000
#number of episodes ; which is actually the maximum number of episodes over which we want to train our agent.
#the maximum number of times steps per episode
max_number_timesteps_per_episode = 1000
#In any attempt on landing on the moon, there's gonna be maximum 1000 times steps.
#Reduce epsilon little by little to test other epsilon values till 0.01; the goal is to check all effect of epsilon
epsilon_starting_value=1.0
epsilon_ending_value= 0.01
epsilon_decay_value = 0.995
# it will help decaying epsilon . for example 1* 0.995= 0.995
epsilon= epsilon_starting_value
#window of scores on 100 episodes
scores_on_100_episodes= deque(maxlen=100)
# double-ended queue
#main
for episode in range(1,number_episodes+1):
#from first episode to last
#reset the environment
state, = env.reset()
#reset environment to initial state, state gets initial state and , gets some other info which is not needed
#initialize score
score=0
#loop over timesteps
for t in range(max_number_timesteps_per_episode):
#select an action
action= agent.act(state,epsilon)
#once it takes an action, it moves to a new state, get rewards etc
next_state,reward,done,_,_= env.step(action)
#training
agent.step(state,action,reward,next_state,done)
#now change the state to new
state=next_state
#update score
score+=reward
# if the episode is done at this specific time step, well we'll simply do a break,
if done:
break
# append the score of that finished episode to that window of the scores on 100 episodes
scores_on_100_episodes.append(score)
#reduce epsilon
epsilon= max(epsilon_ending_value,epsilon_decay_value*epsilon)
print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode,np.mean(scores_on_100_episodes)), end="")
# episode has value of episode #average score np.mean(scores_on_100_episodes) #\r will create dynamic effect, end="" will ensure we don't go to new line
if episode % 100 == 0:
# that means that we are every 100 episode
print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
if np.mean(scores_on_100_episodes) >= 200.0:
# if the average scores_on_100_episodes is larger than 200, well time to say we win
print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode-100, np.mean(scores_on_100_episodes))
)# you can keep episode or put episode-100 #np.mean(scores_on_100_episodes)if this average of the scores_on_100_episodes is larger than 200, that means that in fact, we actually started winning from this episode number minus 100 because this is a score over 100 episodes. torch.save
(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
#save the model parameters
break
Then finally, we visualize the result (Check that from the code)
Run all of the cells
Read more
Arthur Juliani, 2016, Simple Reinforcement Learning with Tensorflow (Part 4)
Tom Schaul et al., Google DeepMind, 2016, Prioritized Experience Replay
Michel Tokic, 2010, Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences