Reinforcement Learning Basics: Proximal Policy Optimization (PPO)

 


Policy gradient methods like REINFORCE and A2C are elegant ways to train agents in reinforcement learning, but they often suffer from high variance, unstable updates, and sensitivity to step sizes. This instability can cause policies to collapse or oscillate wildly.

To address these issues, Schulman et al. introduced Proximal Policy Optimization (PPO) in 2017. PPO has become the go-to algorithm for many RL tasks because it strikes a balance between simplicity, stability, and performance. In this blog we will look at the theory behind this method and solve the environment of Bipedal Walker from the gym.

1. Trust Region Methods 

In RL we cannot recover from making a bad update to the policy by subsequent updates. Instead, the bad policy will provide bad experience samples that we will use in subsequent steps, which could break our policy.

One way to overcome this is reduce learning rate. But this slows down convergence.

Another way is trust region optimization, which constraints the steps taken during the optimization to limit its effect on the policy.

The main idea is to prevent a dramatic policy update during the optimization by checking the KL(Kullback-Leibler) divergence between the old and new policy.

2. What is Proximal Policy Optimization Method?

PPO is a policy gradient algorithm designed to avoid making overly large updates to the policy. Unlike standard policy gradients, PPO introduces a surrogate objective with a clipped probability ratio, which ensures that each policy update stays within a reasonable range — hence the word “proximal.”

The core improvement over the classic A2C method is PPO method uses a new objective that is:

PPO Objective Function

Policy Ratio

Generalized Advantage Estimate


This objective limits the ratio between the old and the new policy to be in the interval [1 −𝜖,1 + 𝜖], so by varying 𝜖, we can limit the size of the update. When  λ=1 then the above generalized advantage estimate becomes the value estimate used in A2C method.

3. A Different Actor Architecture

The actor architecture in this method is a bit different than what we have seen in previous methods.

Actor Architecture as in Code

The last output linear layer learns the mean of action distribution. The log-std is a learnable parameter that learns the log standard deviation of the action distribution. We are not learning standard deviation directly because standard deviation should be greater than zero and so to improve stability, we learn log value and later convert it to std by doing an exponent.

4. The Bipedal Walker Environment

The BipedalWalker-v3 environment from OpenAI Gym is a challenging continuous control task where a two-legged robot must learn to walk across rugged, randomly generated terrain. Each episode tests the agent’s ability to balance, coordinate leg movements, and adapt to slopes and gaps. The 24-dimensional state space includes detailed information about hull angle, velocity, leg positions, and ground contacts, while the 4-dimensional continuous action space controls the torques applied to each leg joint.

5. The Training Loop

  • Initialize the hyperparameters and networks.
  • Play episodes till trajectory length that is samples are sufficient.
  • Compute advantage and reference values.
  • For number of epochs and a particular batch length, go through the above values perform training.
  • Calculate critic loss and backpropagate.
  • Calculate actor loss and backpropagate.
  • Do testing.
  • Iterate till convergence.
6. Implementation

The main highlights of the code are:
  1. The Network: The critic network is similar to the previous methods. The actor network is the one with linear output and a learnable parameter.

  2. Advantage and Values: Calculate the advantage and reference values using the formulas mentioned above.

  3. Critic Training: Once the trajectory size reaches the required limit, start the critic training.

  4. Actor Training: Perform the actor training and reset the trajectory buffer after this.

The code can be found in this repository. I have tried my best to make it as clear as possible, so that the theory can be closely followed.

Code: PPO

It takes around 120 min for convergence on my GTX 1650 card. Check out the video of the agent performing.

7. Conclusion

Proximal Policy Optimization (PPO) is a powerful, practical algorithm that combines the best of policy gradients and trust region ideas, delivering state-of-the-art results with relatively low complexity. Its popularity in research and industry stems from its stability, scalability, and ease of use.

Understanding PPO is essential for moving from basic RL algorithms like REINFORCE and A2C to modern, high-performance methods applied in robotics, games, and real-world control tasks. Thank You and Stay Tuned.


FUTURE BEYOND OUR STAR!

Comments