1. What is Advantage Actor-Critic Method?
The main premise of this method is to separate the network into 2 parts:
- Policy or Actor Network which tells us what to do. It outputs a probability distribution of actions.
- Value or Critic Network which allows us to understand how good our actions were. It estimates the state-value function, the expected return from a given state under the current policy.
3. The Cartpole Environment
The CartPole-v1 environment is a classic control problem widely used in reinforcement learning research and tutorials. The objective is to balance a pole on a moving cart by applying left or right forces.
The state space consists of four continuous variables: cart position, cart velocity, pole angle, and pole angular velocity.
The episode terminates when the pole falls beyond a certain angle, or the cart moves out of bounds. A reward of +1 is given for every time step the pole remains balanced, encouraging the agent to keep the pole upright for as long as possible.
4. The Training Loop
- Collect episodes using the current actor policy.
- Compute discounted returns (used as Q-value estimations).
- Calculate advantages using:
A ( s t , a t ) = Q ( s t , a t ) − V ( s t ) - Update the critic, via MSE loss between predicted and actual returns.
- Update the actor, via the policy gradient loss weighted by the advantage.
- Add entropy regularization to encourage exploration.
- The Network: The network has common linear layer as body and branches out into two heads that is policy or actor head and value or critic head.
The code can be found in this repository. I have tried my best to make it as clear as possible, so that the theory can be closely followed.
It takes around 30 min for convergence on my GTX 1650 card.
6. Conclusion
The Advantage Actor-Critic (A2C) algorithm represents a major leap from pure policy gradients by combining the strengths of value-based and policy-based reinforcement learning. It's elegant, efficient, and forms the core of many high-performance algorithms used in robotics, games, and simulation-based RL.
If you're comfortable with REINFORCE and DQN, A2C is your gateway to modern deep RL. Thank You and Stay Tuned.
Comments
Post a Comment