i. Agent: The learner or decision-maker. It could be a robot, a game-playing bot, or an algorithm trading stocks.
ii. Environment: The world the agent interacts with. It provides feedback in the form of rewards based on the agent’s actions.
iii. State(s): A representation of the current situation of the environment.
iv. Action(a): A choice the agent can make at a given state.
v. Reward(r): A numerical value received after taking an action. It reflects the desirability of the outcome.
vi. Policy(π): A strategy that the agent follows to decide actions based on states.
The learning process follows a feedback loop:
-
The agent observes the current state.
-
It selects an action based on a policy.
-
The environment responds by providing a reward and a new state.
-
The agent updates its policy using this feedback.
Over time, the agent learns which actions yield the highest rewards — not instantly, but through exploration and exploitation.
4. Model-Free vs Model-Based Methods
Model-free methods learn optimal policies or value functions directly from interactions with the environment, without attempting to model the environment's dynamics. This makes them simpler and often more robust in complex or unknown systems, but they typically require a large number of samples to learn effectively.
On the other hand, model-based methods involve learning or using a model of the environment—predicting the next state and reward given a current state and action—which enables the agent to plan and simulate outcomes before acting.
While model-based RL is generally more sample-efficient and capable of strategic foresight, it can suffer from inaccuracies in the learned model, which may lead to poor decisions.
5. Value-Based Methods vs Policy-Based Methods
Value-based methods focus on learning a value function, such as the state-value or action-value , which estimates how good it is to be in a certain state or to take a specific action. The policy is then derived indirectly by selecting actions that maximize the estimated value. Algorithms like Q-Learning and Deep Q-Networks (DQN) fall into this category.
In contrast, policy-based methods learn the policy directly, optimizing it to maximize the expected cumulative reward without relying on a value function. These methods are particularly useful when dealing with continuous action spaces or stochastic policies and include algorithms like REINFORCE and Proximal Policy Optimization (PPO).
While value-based methods are often more sample-efficient, policy-based methods can provide smoother convergence and better performance in high-dimensional or complex environments.
6. On-Policy vs Off-Policy Methods
On-policy methods learn from data collected using the current policy being improved. That means the agent learns from actions it would actually take in the environment. This leads to stable but potentially slower learning since it constantly updates based on its evolving behavior. Examples include Cross-Entropy Method and Proximal Policy Optimization (PPO).
In contrast, off-policy methods learn from experiences generated by a different policy than the one currently being optimized—often a more exploratory one. This allows for more efficient reuse of past experiences and parallel learning from old or external data. Popular off-policy algorithms include Q-Learning, Deep Q-Networks (DQN), and Twin Delayed DDPG (TD3).
Off-policy methods are generally more data-efficient, but can be more prone to instability, especially when combined with function approximation like deep neural networks.
7. Where do we use RL?
- Robotics: Teaching robots to walk, manipulate objects, or navigate environments.
- Gaming: RL agents mastering games like Go, Chess, StarCraft, and Dota.
- Autonomous Driving: Making split-second decisions in dynamic environments.
- Finance: Dynamic portfolio management and high-frequency trading strategies.
- Healthcare: Optimizing treatment strategies for patients over time.
- Q-Learning
- Deep Q-Learning (DQN)
- Cross-Entropy Method
- REINFORCE Method
- Actor to Critic Method (A2C)
- Deep Deterministic Policy Gradients Method (DDPG)
- Twin Delayed Deep Deterministic Policy Gradient Method (TD3)
- Soft Actor Critic (SAC)
- Proximal Policy Optimization (PPO)
Comments
Post a Comment