Reinforcement Learning Basics: Soft Actor Critic Method (SAC)

 


We’ve covered powerful policy gradient methods like A2C, DDPG, and PPO — but these can struggle with stability, overestimation bias, or poor exploration in complex environments.

Soft Actor-Critic (SAC), introduced by Haarnoja et al. (2018), is a maximum entropy reinforcement learning algorithm designed to overcome these challenges. SAC achieves state-of-the-art performance on many continuous control benchmarks, combining sample efficiency, stability, and robust exploration. In this blog we will look at the theory behind this method and try to implement it for Half Cheetah Environment from the Gym.

2. What is Soft Actor Critic (SAC) Method?

The central idea of this method is entropy regularization which adds a bonus reward at each time stamp proportional to the entropy of the policy.

In this method we use 2 Q-Networks and choose the min of the value during the Bellmann Update. This helps in dealing with the q-value overestimation problem during training.

The loss for critic or q networks is:

Critic Loss

where:

Q1Q2′ are the network output.

logπ(as) is the entropy loss

α is the entropy control constant.

The actor loss can be written as:

Actor Loss

3. The Half Cheetah Environment

The HalfCheetah-v5 environment from OpenAI Gym’s MuJoCo suite is a classic benchmark for continuous control algorithms. It simulates a planar, two-dimensional cheetah-like robot with 17 state variables and a 6-dimensional continuous action space controlling the torques of its joints. The agent’s objective is to learn an efficient running gait to maximize its forward velocity along a flat terrain. The environment provides dense rewards based on speed and penalizes excessive joint movements, making it a challenging task that requires smooth, coordinated control.

4. The Training Loop

  • Initialize hyperparameters, 1 actor net, 2 critic nets and 2 target nets. (no target for actor net)
  • Play episodes and store in experience buffer.
  • Once the buffer reaches the training point, sample a batch and get related states, actions so on.
  • Calculate the critic loss and perform critic training.
  • Calculate the actor loss and perform actor training.
  • Perform the soft update for the target nets.
5. Implementation

The main highlights of the code are:
  1. The Network: The critic network is similar to the previous methods. The actor network is the one with linear output and a learnable parameter.

  2. Critic Training: Calculate the critic loss and train it.

  3. Actor Training: Calculate the actor loss and perform the actor training.

The code can be found in this repository. I have tried my best to make it as clear as possible, so that the theory can be closely followed.

Code: SAC

It takes around 500 min for a score of 1187 on my GTX 1650 card. You can try training for more time and tune the hyperparameters to improve the walking style. Check out the video of the agent performing.

6. Conclusion

Soft Actor-Critic (SAC) stands out as one of the most robust algorithms for continuous action spaces today. By combining maximum entropy objectives, twin critics, and automatic entropy adjustment, SAC achieves state-of-the-art performance with remarkable stability and sample efficiency, making it the algorithm of choice in many robotics and continuous control applications. Thank You and Stay Tuned.


FUTURE BEYOND OUR STAR!
















Comments