In this blog we will take a look at a relatively simple RL method called Cross Entropy method. It is really simple to implement and has a good convergence in simple environments.
1. What is Cross Entropy Method (CEM)?
The main core of CEM is to throw away bad episodes and train on good/elite episodes. Its a model-free, policy-based and on-policy method.
The name comes from the cross-entropy loss used to train the policy. By minimizing the cross-entropy between the current policy’s output and the actions taken in elite episodes, the method encourages the agent to reproduce high-reward behavior. The CE loss function is as follows:
Cross Entropy Loss Function |
2. The Cartpole Environment
The CartPole-v1 environment is a classic control problem widely used in reinforcement learning research and tutorials. The objective is to balance a pole on a moving cart by applying left or right forces.
The state space consists of four continuous variables: cart position, cart velocity, pole angle, and pole angular velocity.
- The Network: A simple 2 linear layer network has been used. Output of the last layer is passed through Softmax to get the probability distribution of the actions.
- Play N Episodes: Play around 50 episodes and store them in the experience replay. This has been explained in the previous blog for DQN.
- Training: Train on the remaining episodes and continue this process of playing and training till the specified number of epochs is reached. (50 epochs in our case).
CEM in reinforcement learning is a powerful episodic method where learning is guided by selecting and imitating high-reward episodes.
This makes it a great baseline for discrete-action environments like CartPole, and a valuable teaching tool for understanding how reward-driven filtering can shape policy learning—without ever computing gradients through reward functions. Thank You and Stay Tuned.
FUTURE BEYOND OUR STAR!
Comments
Post a Comment