Reinforcement Learning Basics: Introduction

In recent years, Reinforcement Learning (RL) has surged into the spotlight, driving advancements in fields like robotics, game AI, autonomous vehicles, and intelligent control systems. If you've ever marveled at an AI mastering chess, beating professionals in Go, or learning to walk using simulated physics, you're witnessing RL in action.

This blog is a comprehensive introduction to the "Reinforcement Learning Basics" series, where we take a look at different foundational RL models and try to understand what makes them tick.

1. History of RL

The roots of Reinforcement Learning (RL) trace back to the fields of psychology, control theory, and computer science. Its conceptual foundations were laid in the 1950s and 60s with early work on trial-and-error learning and temporal difference methods. Pioneers like Richard Bellman introduced dynamic programming, which provided a mathematical basis for decision-making over time. In the 1980s and 90s, RL matured through key contributions like Q-learning (by Watkins) and TD-learning (by Sutton), bridging the gap between theory and practical algorithms. The 2010s saw an explosion of interest as deep learning was integrated with RL — giving rise to Deep Reinforcement Learning, popularized by DeepMind’s AlphaGo.

2. Key Components of RL

i. Agent: The learner or decision-maker. It could be a robot, a game-playing bot, or an algorithm trading stocks.

ii. Environment: The world the agent interacts with. It provides feedback in the form of rewards based on the agent’s actions.

iii. State(s): A representation of the current situation of the environment.

iv. Action(a): A choice the agent can make at a given state.

v. Reward(r): A numerical value received after taking an action. It reflects the desirability of the outcome.

vi. Policy(π): A strategy that the agent follows to decide actions based on states.

3. RL Loop

The learning process follows a feedback loop:

The agent observes the current state.

It selects an action based on a policy.

The environment responds by providing a reward and a new state.

The agent updates its policy using this feedback.

Over time, the agent learns which actions yield the highest rewards — not instantly, but through exploration and exploitation.
4. Model-Free vs Model-Based Methods
Model-free methods learn optimal policies or value functions directly from interactions with the environment, without attempting to model the environment's dynamics. This makes them simpler and often more robust in complex or unknown systems, but they typically require a large number of samples to learn effectively.
On the other hand, model-based methods involve learning or using a model of the environment—predicting the next state and reward given a current state and action—which enables the agent to plan and simulate outcomes before acting.
While model-based RL is generally more sample-efficient and capable of strategic foresight, it can suffer from inaccuracies in the learned model, which may lead to poor decisions.
5. Value-Based Methods vs Policy-Based Methods
Value-based methods focus on learning a value function, such as the state-value $V(s)$ or action-value $Q(s,a)$ , which estimates how good it is to be in a certain state or to take a specific action. The policy is then derived indirectly by selecting actions that maximize the estimated value. Algorithms like Q-Learning and Deep Q-Networks (DQN) fall into this category.
In contrast, policy-based methods learn the policy directly, optimizing it to maximize the expected cumulative reward without relying on a value function. These methods are particularly useful when dealing with continuous action spaces or stochastic policies and include algorithms like REINFORCE and Proximal Policy Optimization (PPO).
While value-based methods are often more sample-efficient, policy-based methods can provide smoother convergence and better performance in high-dimensional or complex environments.
6. On-Policy vs Off-Policy Methods
On-policy methods learn from data collected using the current policy being improved. That means the agent learns from actions it would actually take in the environment. This leads to stable but potentially slower learning since it constantly updates based on its evolving behavior. Examples include Cross-Entropy Method and Proximal Policy Optimization (PPO).

In contrast, off-policy methods learn from experiences generated by a different policy than the one currently being optimized—often a more exploratory one. This allows for more efficient reuse of past experiences and parallel learning from old or external data. Popular off-policy algorithms include Q-Learning, Deep Q-Networks (DQN), and Twin Delayed DDPG (TD3).
Off-policy methods are generally more data-efficient, but can be more prone to instability, especially when combined with function approximation like deep neural networks.
7. Where do we use RL?
Robotics: Teaching robots to walk, manipulate objects, or navigate environments.
Gaming: RL agents mastering games like Go, Chess, StarCraft, and Dota.
Autonomous Driving: Making split-second decisions in dynamic environments.
Finance: Dynamic portfolio management and high-frequency trading strategies.
Healthcare: Optimizing treatment strategies for patients over time.
8. What's Next?
In the upcoming blogs we will look at 9 different RL methods and try to understand how to implement them and pick apart their inner workings. These methods are:
Q-Learning
Deep Q-Learning (DQN)
Cross-Entropy Method
REINFORCE Method
Actor to Critic Method (A2C)
Deep Deterministic Policy Gradients Method (DDPG)
Twin Delayed Deep Deterministic Policy Gradient Method (TD3)
Soft Actor Critic (SAC)
Proximal Policy Optimization (PPO)
This is the repository that I created for mastering the basic implementations. Take a look at it:
RL_CONCEPTS Repo

The main framework used is PyTorch, so it would be better to have some basic understanding of this framework as well as Python Programming. Consider this as a pre-requisite for the following blogs. Thank You and Stay Tuned.

FUTURE BEYOND OUR STAR!

Robotics % Rave

Search This Blog