Artificial Intelligence : Reinforcement Learning- A3C (Part 38)

The three A's in A3C

Actor Critic

Now, we can reduce the size of the last one (red) for another output

Now, we have 2 outputs in last

So, the top part have Q values and is called Actor as it has the leading role

and the second one has 1 value and that's Value of the state , V(s)

all of these Q values, as we know, that's also called policy.

And the second part is called Critic

Asynchronous

So, we started with a puzzle and it had states. Depending on state and reward we took actions, again and again.

So, now instead of 1 agent, we can have 3 agents at the same time who can start from different states and attack the same puzzle.

All of a sudden you're getting triple the amount of experience. Instead of just one agent going through and exploring an environment and trying to understand how to operate in that environment, you now have three, or however many of them, going through that and getting this experience. And so they're so that each one of them is learning through this bigger experience.

So, if we think of the computation for three agents, it would be

But they are not sharing their knowledge. And to solve this issue, we can collect all of the learning from critic,

We can call it as, every time all these agents they're contributing to the same critic. They don't have separate critics, they have a common critic.

The critic part is shared between the agents and that is how they share the information between each other.

So, finally, this is going to be the architecture

Advantage

Assume one of the agents is playing and we came got a Q value and a V

Now, we can minus them and get the advantage. And advantage is used in the calculation of the policy loss.

So, the whole A3C algorithm says, okay so, the critic knows a V value. How much better is your Q value that you're selecting compared to the known V value? That's what it's saying.

Now, as they all know the value of critic, the model back propagates and makes changes of the weights .

So, you can see one single prediction impacted the weights and the model is learning.

We do it for other cases as well

To sum up, If we selected something and then the advantage was very low, the network's gonna be updated in such a way that, next time this Q value of that certain action is gonna be less and maybe something else will be more. So that's how that, that is played out.

And so basically this whole policy loss helps the network adapt or more in such a way that we do more of the good actions and good things, and do less of the bad things/actions.

LSTM (Long Short Term Memory)

We will add one LSTM layer before the final layer

Xt is a vector of values, goes into our LSTM, and then as an output, you get another vector, which is the concatenation of this store, or somehow it ties in with network, in our case, as an output, you get this (ht)

If we just zoom it, this will look like this

We have already discussed about this in our blog on RNN

Why do we need memory in our A3C or other algorithms?

Imagine in this breakout game,

You see this picture. What do you extract from here? What would your actions be here from here?

You can see the ball is flying, right? Well, it's flying, right, so it's going somewhere. And maybe it's flying towards you, right?

Could you make this conclusion? Could you like anticipate that it's coming towards you?

You probably could and maybe you're in the right spot to catch the ball, but what if the ball

was not moving towards you?

You actually don't know it's previous movements and can't assume where it was moving actually.

Let's code this down

We will solve the problem KungfuMaster today

We have 14 actions here

Which is a lot for computing. So, we will take easier options by working with this variant .

Make sure to check codes of DQN and DCQN prior to this code

Install Gymnasium

!pip install gymnasium

!pip install "gymnasium[atari, accept-rom-license]"

!apt-get install -y swig

!pip install gymnasium[box2d]

Importing the libraries

import cv2

import math

import random

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import torch.nn.functional as F

import torch.multiprocessing as mp

import torch.distributions as distributions

from torch.distributions import Categorical

import gymnasium as gym

from gymnasium import ObservationWrapper

from gymnasium.spaces import Box

Creating the architecture of the Neural Network

class Network(nn.Module):#network class from nn.Module

def init(self, actions_size): #for images we don't take state_size

super(Network, self).__init__() #super inheritence activated

#Note this kungfu problem will also give us images to process . So, we will have convolutional layers

self.conv1 = nn.Conv2d(4, 32, kernel_size=(3,3), stride=2) #input channel =4 beacause here in A3C model, we're gonna have a stack of four gray scale frames from our Kung Fu environment, output channel =32 (initially start with it), kernel size and stride are based on experiments #In DCQN, increasing output layer is good but here in A3C, we won't (not a rules of thumb but this is an experimented result for this game)

self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2)#32 input from conv1, output layer 32

self.conv3 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2)#32 input from conv1, output layer 32

self.flatten= torch.nn.Flatten()#flattening the output

#full connection

self.fc1=torch.nn.Linear(512,128)# input = number of output from flatten layer(google it but it will be 512 for this conv1 to flatten layer), output = 128 (rule of thumb)

self.fc2a=torch.nn.Linear(128,actions_size) #the output of the second full connection fc2a will be the action values, meaning the Q values for each of the actions representing this expected return; actor

self.fc2c=torch.nn.Linear(128,1)

#final output layer containing the state/critic value, and this one; critic

Forward method

def forward(self, x):

x = F.relu(self.conv1(x)) #forward propagate the state to conv1 and then activate using relu

x = F.relu(self.conv2(x)) #takes x to conv2 and activate

x = F.relu(self.conv3(x)) #takes x to conv3 and activate

x=self.flatten(x) #then forward propagate x to flatten layer ; no activation

#connecting to full connection layer

x = F.relu(self.fc1(x)) #forward propagate x to fc1 and activate

#connecting to actor

action_values=self.fc2a(x) #forward

#connecting to critic

state_value=self.fc2c(x)[0] #forward; only to get the value , we took the 1st index
return action_values, state_value

Setting up the environment

We won't explain it line by line. But to give you an idea, we will define properties of the environment, reset environment, update the buffer, create the environment variables, create environment and then get state_shape, number of actions.

Initializing the hyperparameters

learning_rate=1e-4

discount_factor=0.99

number_environments=10 #we actually train multiple agents in multiple environments in parallel. And that's the super brilliant part of this algorithm

Implementing the A3C class

class Agent():

def init(self,action_size): #same as DCQN

Set which device to choose (GPU or CPU)

self.device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

Action size

self.action_size=action_size

#we have just 1 network here

self.network=Network(action_size).to(self.device) #new

Then use the optimizer

self.optimizer=optim.Adam( self.network.parameters(),lr=learning_rate)

Then we will use act(), step() [Note; previously in DCQN, DQN whatever we did in the learn() is added in the step()}

Make sure to check the explained code from the GitHub repo for step(), act()

Initializing the A3C agent

agent=Agent(number_actions)

Evaluating our A3C agent on a single episode

def evaluate(agent,env,number_episodes=1):#agent, environment, number of episodes

#expecting list of rewards

episodes_rewards=[]

for _ in range(numberepisodes):

state, _= env.reset() #returns the initialized state

total_reward=0

#break if episodes are done

while True:

#play an action

action=agent.act(state)

#next state, reward and done (if we are done or not)

state,reward,done,info,_=env.step(action[0])

total_reward+=reward

#check if we are done with episode

if done:

break

episodes_rewards.append(total_reward)

return episodes_rewards

Testing multiple agents on multiple environments at the same time

class EnvBatch:

#well we'll be able to handle simultaneously multiple environments by, for example, facilitating the simultaneous stepping and resetting of multiple environments so that we can use that asynchronous feature of the A3C algorithm

def init(self,n_env=10):# to create 10 environment parralall way

self.envs=[make_env() for in range(nenv)] #we're gonna create multiple environments using the make_env() created earlier

def reset(self): #reset all environments

_states=[]

for env in self.envs:

states.append(env.reset()[0])

#only return the initialized states return np.array(_states) #return in the format of numpy array

def step(self,actions):

#make one to step in multiple environments; we wanna do now is for multiple agents to step in these multiple environments after playing these multiple actions.

next_states,rewards,dones, infos,_ = map(np.array,zip(*[env.step(a) for env,a in zip(self.envs,actions)]))

# we are taking env from all the environments and then we will apply step method to id. But we need 1 action at a time but actions has multiple actions. So, we will double loop to get specific actions and use zip() to loop two values at the same time

#finally to group them together use zip(*), then use map(np.array()) to get our group of next states, rewards, dones, infos, all converted into NumPy arrays. for i in range(len(self.envs)):

if dones[i]: #means if dones i meaning if the dones boolean of our environment of index i is equal to true,

next_states[i]= self.envs[i].reset()[0] #rest next_states of our particular environment of index i

return next_states,rewards,dones,infos

Training the A3C agent

import tqdm

env_batch = EnvBatch(number_environments)

batch_states = env_batch.reset() #reset all of the states

#tqdm is a progress bar

with tqdm.trange(0, 3001) as progress_bar:

for i in progress_bar:

batch_actions = agent.act(batch_states) # basically we play the actions in that batch of states. #While after playing the action we reach a next state and we get a reward and we know whether the episode is done or no batch_next_states, batch_rewards, batch_dones, = envbatch.step(batch_actions)

"""common practice use in reinforcement learning to stabilize the training because indeed now we're dealing with actually high rewards, you know, coming from the "Kung Fu Master" environment and we need to reduce the magnitude of these rewards in order to stabilize the training."""

batch_rewards *= 0.01 #reduce the magnitude of batch reward agent.step(batch_states, batch_actions, batch_rewards, batch_next_states, batch_dones)

batch_states = batch_next_states #update batch of states

if i % 1000 == 0: #after evbery 1000 episodes, print average score print("Average agent reward: ", np.mean(evaluate(agent, env,number_episodes = 10)))

Finally, run and visualize the result

You can see the result like this

This should be the visualization

That's it!!

Read more

Volodymyr Mnih et al, 2016 Asynchronous Methods for Deep Reinforcement Learning
Jaromír Janisch, 2017 Let’s Make An A3c: Implementation
John Schulman et al., 2016 High-dimensional Continuous Control Using Generalized Advantage Estimation
Arthur Juliani, 2016 Simple Reinforcement Learning with Tensorflow (Part 8)