Artificial Intelligence : Reinforcement Learning- A3C (Part 38)
The three A's in A3C
Actor Critic
Now, we can reduce the size of the last one (red) for another output
Now, we have 2 outputs in last
So, the top part have Q values and is called Actor as it has the leading role
and the second one has 1 value and that's Value of the state , V(s)
all of these Q values, as we know, that's also called policy.
And the second part is called Critic
Asynchronous
So, we started with a puzzle and it had states. Depending on state and reward we took actions, again and again.
So, now instead of 1 agent, we can have 3 agents at the same time who can start from different states and attack the same puzzle.
All of a sudden you're getting triple the amount of experience. Instead of just one agent going through and exploring an environment and trying to understand how to operate in that environment, you now have three, or however many of them, going through that and getting this experience. And so they're so that each one of them is learning through this bigger experience.
So, if we think of the computation for three agents, it would be
But they are not sharing their knowledge. And to solve this issue, we can collect all of the learning from critic,
We can call it as, every time all these agents they're contributing to the same critic. They don't have separate critics, they have a common critic.
The critic part is shared between the agents and that is how they share the information between each other.
So, finally, this is going to be the architecture
Advantage
Assume one of the agents is playing and we came got a Q value and a V
Now, we can minus them and get the advantage. And advantage is used in the calculation of the policy loss.
So, the whole A3C algorithm says, okay so, the critic knows a V value. How much better is your Q value that you're selecting compared to the known V value? That's what it's saying.
Now, as they all know the value of critic, the model back propagates and makes changes of the weights .
So, you can see one single prediction impacted the weights and the model is learning.
We do it for other cases as well
To sum up, If we selected something and then the advantage was very low, the network's gonna be updated in such a way that, next time this Q value of that certain action is gonna be less and maybe something else will be more. So that's how that, that is played out.
And so basically this whole policy loss helps the network adapt or more in such a way that we do more of the good actions and good things, and do less of the bad things/actions.
LSTM (Long Short Term Memory)
We will add one LSTM layer before the final layer
Xt is a vector of values, goes into our LSTM, and then as an output, you get another vector, which is the concatenation of this store, or somehow it ties in with network, in our case, as an output, you get this (ht)
If we just zoom it, this will look like this
We have already discussed about this in our blog on RNN
Why do we need memory in our A3C or other algorithms?
Imagine in this breakout game,
You see this picture. What do you extract from here? What would your actions be here from here?
You can see the ball is flying, right? Well, it's flying, right, so it's going somewhere. And maybe it's flying towards you, right?
Could you make this conclusion? Could you like anticipate that it's coming towards you?
You probably could and maybe you're in the right spot to catch the ball, but what if the ball
was not moving towards you?
You actually don't know it's previous movements and can't assume where it was moving actually.
Let's code this down
We will solve the problem KungfuMaster today
We have 14 actions here
Which is a lot for computing. So, we will take easier options by working with this variant .
Make sure to check codes of DQN and DCQN prior to this code
Install Gymnasium
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]
Importing the libraries
import cv2
import math
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.multiprocessing as mp
import torch.distributions as distributions
from torch.distributions import Categorical
import gymnasium as gym
from gymnasium import ObservationWrapper
from gymnasium.spaces import Box
Creating the architecture of the Neural Network
class Network(nn.Module):
#network class from nn.Module
def init(self, actions_size):
#for images we don't take state_size
super(Network, self).__init__()
#super inheritence activated
#Note this kungfu problem will also give us images to process . So, we will have convolutional layers
self.conv1 = nn.Conv2d(4, 32, kernel_size=(3,3), stride=2)
#input channel =4 beacause here in A3C model, we're gonna have a stack of four gray scale frames from our Kung Fu environment, output channel =32 (initially start with it), kernel size and stride are based on experiments #In DCQN, increasing output layer is good but here in A3C, we won't (not a rules of thumb but this is an experimented result for this game)
self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2)
#32 input from conv1, output layer 32
self.conv3 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2)
#32 input from conv1, output layer 32
self.flatten= torch.nn.Flatten()
#flattening the output
#full connection
self.fc1=torch.nn.Linear(512,128)
# input = number of output from flatten layer(google it but it will be 512 for this conv1 to flatten layer), output = 128 (rule of thumb)
self.fc2a=torch.nn.Linear(128,actions_size)
#the output of the second full connection fc2a will be the action values, meaning the Q values for each of the actions representing this expected return; actor
self.fc2c=torch.nn.Linear(128,1)
#final output layer containing the state/critic value, and this one; critic
Forward method
def forward(self, x):
x = F.relu(self.conv1(x))
#forward propagate the state to conv1 and then activate using relu
x = F.relu(self.conv2(x))
#takes x to conv2 and activate
x = F.relu(self.conv3(x))
#takes x to conv3 and activate
x=self.flatten(x)
#then forward propagate x to flatten layer ; no activation
#connecting to full connection layer
x = F.relu(self.fc1(x))
#forward propagate x to fc1 and activate
#connecting to actor
action_values=self.fc2a(x)
#forward
#connecting to critic
state_value=self.fc2c(x)[0
] #forward; only to get the value , we took the 1st index
return action_values, state_value
Setting up the environment
We won't explain it line by line. But to give you an idea, we will define properties of the environment, reset environment, update the buffer, create the environment variables, create environment and then get state_shape, number of actions.
Initializing the hyperparameters
learning_rate=1e-4
discount_factor=0.99
number_environments=10
#we actually train multiple agents in multiple environments in parallel. And that's the super brilliant part of this algorithm
Implementing the A3C class
class Agent():
def init(self,action_size):
#same as DCQN
Set which device to choose (GPU or CPU)
self.device=torch.device("cuda" if
torch.cuda.is
_available() else "cpu")
Action size
self.action_size=action_size
#we have just 1 network here
self.network
=Network(action_size).to(self.device)
#new
Then use the optimizer
self.optimizer=optim.Adam(
self.network
.parameters(),lr=learning_rate)
Then we will use act(), step() [Note; previously in DCQN, DQN whatever we did in the learn() is added in the step()}
Make sure to check the explained code from the GitHub repo for step(), act()
Initializing the A3C agent
agent=Agent(number_actions)
Evaluating our A3C agent on a single episode
def evaluate(agent,env,number_episodes=1):
#agent, environment, number of episodes
#expecting list of rewards
episodes_rewards=[]
for _ in range(numberepisodes):
state, _= env.reset()
#returns the initialized state
total_reward=0
#break if episodes are done
while True:
#play an action
action=agent.act(state)
#next state, reward and done (if we are done or not)
state,reward,done,info,_=env.step(action[0])
total_reward+=reward
#check if we are done with episode
if done:
break
episodes_rewards.append(total_reward)
return episodes_rewards
Testing multiple agents on multiple environments at the same time
class EnvBatch:
#well we'll be able to handle simultaneously multiple environments by, for example, facilitating the simultaneous stepping and resetting of multiple environments so that we can use that asynchronous feature of the A3C algorithm
def init(self,n_env=10):
# to create 10 environment parralall way
self.envs=[make_env() for in range(nenv)]
#we're gonna create multiple environments using the make_env() created earlier
def reset(self):
#reset all environments
_states=[]
for env in self.envs:
states.append(env.reset()[0])
#only return the initialized states return np.array(_states) #return in the format of numpy array
def step(self,actions):
#make one to step in multiple environments; we wanna do now is for multiple agents to step in these multiple environments after playing these multiple actions.
next_states,rewards,dones, infos,_ = map(np.array,zip(*[env.step(a) for env,a in zip(self.envs,actions)]))
# we are taking env from all the environments and then we will apply step method to id. But we need 1 action at a time but actions has multiple actions. So, we will double loop to get specific actions and use zip() to loop two values at the same time
#finally to group them together use zip(*), then use map(np.array()) to get our group of next states, rewards, dones, infos, all converted into NumPy arrays. for i in range(len(self.envs)):
if dones[i]:
#means if dones i meaning if the dones boolean of our environment of index i is equal to true,
next_states[i]= self.envs[i].reset()[0]
#rest next_states of our particular environment of index i
return next_states,rewards,dones,infos
Training the A3C agent
import tqdm
env_batch = EnvBatch(number_environments)
batch_states = env_batch.reset()
#reset all of the states
#tqdm is a progress bar
with tqdm.trange(0, 3001) as progress_bar
:
for i in progress_bar:
batch_actions = agent.act(batch_states)
# basically we play the actions in that batch of states. #While after playing the action we reach a next state and we get a reward and we know whether the episode is done or no batch_next_states, batch_rewards, batch_dones, = envbatch.step(batch_actions)
"""common practice use in reinforcement learning to stabilize the training because indeed now we're dealing with actually high rewards, you know, coming from the "Kung Fu Master" environment and we need to reduce the magnitude of these rewards in order to stabilize the training."""
batch_rewards *= 0.01
#reduce the magnitude of batch reward agent.step(batch_states, batch_actions, batch_rewards, batch_next_states, batch_dones)
batch_states = batch_next_states #update batch of states
if i % 1000 == 0:
#after evbery 1000 episodes, print average score print("Average agent reward: ", np.mean(evaluate(agent, env,number_episodes = 10)))
Finally, run and visualize the result
You can see the result like this
This should be the visualization
That's it!!
Read more
Volodymyr Mnih et al, 2016 Asynchronous Methods for Deep Reinforcement Learning
Jaromír Janisch, 2017 Let’s Make An A3c: Implementation
John Schulman et al., 2016 High-dimensional Continuous Control Using Generalized Advantage Estimation
Arthur Juliani, 2016 Simple Reinforcement Learning with Tensorflow (Part 8)