Reinforcement Learning Basics: Advantage Actor-Critic Policy Gradient Method (A2C)

 



We've seen how REINFORCE, a Monte Carlo Policy Gradient method, updates policies by maximizing expected return using complete episode rollouts. While simple and elegant, REINFORCE suffers from high variance and slow convergence.
This blog introduces the Advantage Actor-Critic (A2C) algorithm, a synchronous and simplified variant of the Actor-Critic family. It combines the benefits of policy gradients and value function approximation, leading to faster and more stable learning.

1. What is Advantage Actor-Critic Method?

The main premise of this method is to separate the network into 2 parts:

  • Policy or Actor Network which tells us what to do. It outputs a probability distribution of actions.
  • Value or Critic Network which allows us to understand how good our actions were. It estimates the state-value function, the expected return from a given state under the current policy.
The Actor or Policy loss can is a log loss which is as follows:


where:
πθ(atst) is current policy for current action state pair.
A(st,at) is advantage which can be written as (Q(s,a)-V(s)).

The Critic or Value loss is a simple Mean Square Error loss which is as follows:


where:
Vw(st) is the estimated value for state(s) by critic.
Rt is the estimated return or Q-value approximation.

2. Exploration Problem

Even with policy represented as probability distribution, there is a high chance that the agent will converge to some locally optimal policy and stop exploring. To solve this, we use an entropy loss function and add it to existing actor and critic loss.


where:
π(as) is the policy for state(s) taking action(a).

Entropy is a measure of uncertainty in the policy's action distribution. A high-entropy policy implies the agent is exploring multiple actions, while low entropy means it is confident in one choice.

3. The Cartpole Environment

The CartPole-v1 environment is a classic control problem widely used in reinforcement learning research and tutorials. The objective is to balance a pole on a moving cart by applying left or right forces. 

The state space consists of four continuous variables: cart position, cart velocity, pole angle, and pole angular velocity. 

The episode terminates when the pole falls beyond a certain angle, or the cart moves out of bounds. A reward of +1 is given for every time step the pole remains balanced, encouraging the agent to keep the pole upright for as long as possible. 

4. The Training Loop

  • Collect episodes using the current actor policy.
  • Compute discounted returns (used as Q-value estimations).
  • Calculate advantages using: A(st,at)=Q(st,at)V(st)
  • Update the critic, via MSE loss between predicted and actual returns.
  • Update the actor, via the policy gradient loss weighted by the advantage.
  • Add entropy regularization to encourage exploration.
5. Implementation

The main highlights of the code are:

  1. The Network: The network has common linear layer as body and branches out into two heads that is policy or actor head and value or critic head.

  2. Play: A batch of 4 episodes are collected and their reward estimates or q-values are calculated as follows.

  3. Training: Calculate the losses as follows and perform the backpropagation till convergence.

The code can be found in this repository. I have tried my best to make it as clear as possible, so that the theory can be closely followed.

Code: A2C

It takes around 30 min for convergence on my GTX 1650 card.

6. Conclusion

The Advantage Actor-Critic (A2C) algorithm represents a major leap from pure policy gradients by combining the strengths of value-based and policy-based reinforcement learning. It's elegant, efficient, and forms the core of many high-performance algorithms used in robotics, games, and simulation-based RL.

If you're comfortable with REINFORCE and DQN, A2C is your gateway to modern deep RL. Thank You and Stay Tuned.


FUTURE BEYOND OUR STAR!



Comments