Reinforcement Learning Basics: Twin Delayed Deep Deterministic Policy Gradient Method (TD3)

 


Earlier, we explored Deep Deterministic Policy Gradient (DDPG), a foundational algorithm for continuous control. While powerful, DDPG suffers from overestimation bias, poor exploration, and instability, especially in high-dimensional tasks.

Twin Delayed Deep Deterministic Policy Gradient (TD3), introduced by Fujimoto et al. (2018), is a direct successor to DDPG. TD3 adds three critical improvements that dramatically improve stability and performance on continuous control benchmarks, making it one of the go-to algorithms for real-world robotics and control problems. In this blog we will implement this method for Inverted Double Pendulum Mujoco environment.

1. How can we convert DDPG to TD3?

TD3 is basically an improvement over the DDPG algorithm and following are the 3 steps required to change it from DDPG to TD3.

  • Double Q Network: Similar to what we have seen in SAC, we use 2 critic networks to estimate the Q values and take the minimum of both during the update.
  • Delayed Policy Update: Policy/actor and target networks are updated less frequently. In the paper authors update them for every 2 critic network updates.
  • Target Policy Smoothing: We also add noise to target actions, not only the main policy network. This makes it harder for the policy to exploit the q-function errors by smoothing out Q along with the changes in action.
2. The Inverted Double Pendulum Environment

The InvertedDoublePendulum-v5 environment from OpenAI Gym’s MuJoCo suite is a classic benchmark for testing continuous control algorithms on underactuated systems. In this environment, an agent must balance a tall double pendulum upright on a cart moving along a one-dimensional track. The state space is 11-dimensional, capturing the cart’s position and velocity as well as the angles and angular velocities of both pendulum links. The agent’s single continuous action controls the force applied to the cart, making the task highly challenging due to its nonlinear, unstable dynamics.

3. The Training Loop

  • Collect an episode of data by interacting with the environment using the current actor plus exploration noise (e.g., Ornstein-Uhlenbeck noise).
  • Store each transition (s,a,r,s,d)(s, a, r, s', d) in the replay buffer.

  • Once the replay buffer has enough samples (e.g., >10,000), sample a batch.

  • Calculate the critic loss similar to SAC method and backpropagate.

  • Calculate the actor loss similar to SAC method and backpropagate.

  • Periodically test the agent’s performance on the evaluation environment (e.g., every 100 episodes).

  • Save the model if test reward exceeds the best so far or a predefined threshold.

4. Implementation

The main highlights of the code are:
  1. The Network: The actor net and the critic network are defined separately as shown in the flowchart. 

  2. Play: Collect at least 10,000 states and store it in the replay buffer. Sample a batch of 64 samples.
  3. Critic Training: Calculate the critic loss and perform the backpropagation.

  4. Actor Training: Calculate the actor loss and perform the backpropagation.

  5. Soft-Update:  Sync the target networks by performing Polyak Averaging an in DDPG.
The code can be found in this repository. I have tried my best to make it as clear as possible, so that the theory can be closely followed.
 
It takes around 60 min for getting a reward of 9350 on my GTX 1650 card. Check out the video of the agent performing.

5. Conclusion

Twin Delayed Deep Deterministic Policy Gradient (TD3) represents a significant advancement in deterministic continuous control RL. Its combination of twin critics, target policy smoothing, and delayed updates delivers reliable, stable, and high-performing policies on challenging benchmarks.

TD3 remains one of the best choices for continuous action spaces when stochastic exploration is not required, particularly in robotics, control, and simulation. Thank You and Stay Tuned.


FUTURE BEYOND OUR STAR!





Comments