Reinforcement Learning Basics: Cross Entropy Method(CEM)

In this blog we will take a look at a relatively simple RL method called Cross Entropy method. It is really simple to implement and has a good convergence in simple environments.

1. What is Cross Entropy Method (CEM)?

The main core of CEM is to throw away bad episodes and train on good/elite episodes. Its a model-free, policy-based and on-policy method.

The name comes from the cross-entropy loss used to train the policy. By minimizing the cross-entropy between the current policy’s output and the actions taken in elite episodes, the method encourages the agent to reproduce high-reward behavior. The CE loss function is as follows:

Cross Entropy Loss Function

Basically cross-entropy determines the distance between true probability distribution and predicted probability distribution.

2. The Cartpole Environment

The CartPole-v1 environment is a classic control problem widely used in reinforcement learning research and tutorials. The objective is to balance a pole on a moving cart by applying left or right forces.

$The state space consists of four continuous variables: cart position, cart velocity, pole angle, and pole angular velocity.$

$The episode terminates when the pole falls beyond a certain angle, or the cart moves out of bounds. A reward of +1 is given for every time step the pole remains balanced, encouraging the agent to keep the pole upright for as long as possible.$

$CartPole is ideal for testing policy-based algorithms like the Cross Entropy Method due to its simple dynamics, dense reward structure, and low-dimensional observation space. 3. Implementation The main highlights of the code are:$

The Network: A simple 2 linear layer network has been used. Output of the last layer is passed through Softmax to get the probability distribution of the actions.
Play N Episodes: Play around 50 episodes and store them in the experience replay. This has been explained in the previous blog for DQN.
Filtering: Use the experience replay buffer and the accumulated reward for each episode and throw away all the episodes with a reward below the boundary that is 92nd percentile in this case.
Training: Train on the remaining episodes and continue this process of playing and training till the specified number of epochs is reached. (50 epochs in our case).

The code can be found in this repository. I have tried my best to make it as clear as possible, so that the theory can be closely followed.

$Code: Cross Entropy Method$

$It takes around 180 min for 50 epochs on my GTX 1650 card.$

$4. Conclusion$

CEM in reinforcement learning is a powerful episodic method where learning is guided by selecting and imitating high-reward episodes.

$This makes it a great baseline for discrete-action environments like CartPole, and a valuable teaching tool for understanding how reward-driven filtering can shape policy learning—without ever computing gradients through reward functions.$ Thank You and Stay Tuned.

Robotics % Rave

Search This Blog

Reinforcement Learning Basics: Cross Entropy Method(CEM)

FUTURE BEYOND OUR STAR!

Comments

Post a Comment