Imagine teaching a robot to cross a room without bumping into furniture. You could give it a set of instructions, but what if the room changes every time? That’s where Reinforcement Learning (RL) comes in—specifically, Q-Learning, one of the most fundamental and powerful algorithms in RL.
In this blog, we’ll explore what Q-learning is, how it works, where it's used, and how you can implement it from scratch for the Frozen Lake Environment of Gym.
1. What is Q-Learning?
Q-Learning is a model-free, value-based reinforcement learning algorithm. It helps an agent learn how to act optimally in a given environment by learning the value (or quality) of actions taken in specific states.
The goal is to learn the Q-function:
Over time, the agent updates its Q-values by interacting with the environment, receiving rewards, and adjusting its future behavior accordingly.
2. Bellman Optimality Equation
Bellman optimality principle states that the optimal value of a state can be computed based on the immediate reward and the optimal value of the next state.
i) Value for Action (Q-Value)
where:- Q-Value basically states what is the max reward by taking this particular action.
- Whereas the state value tells us what is the action that can be taken in this state to get max reward.
The agent starts with a Q-table initialized to zeros:
As it explores the environment using ε-greedy policy, the agent updates its Q-values using:
4. Exploration vs Exploitation
Q-Learning uses an ε-greedy strategy:
-
With probability , the agent explores (random action).
-
With probability , it exploits (chooses best known action).
This ensures the agent doesn’t get stuck in a suboptimal strategy early on.
5. The Frozen Lake Environment
The FrozenLake environment is a classic benchmark from OpenAI Gym used to test reinforcement learning algorithms in a discrete grid-world setting. It represents a frozen lake where an agent must learn to navigate from a start point to a goal without falling into holes. The environment is structured as a grid of tiles that can be:
-
S
: Start -
F
: Frozen (safe) -
H
: Hole (game over if entered) -
G
: Goal
The challenge comes from the slippery surface, which introduces stochasticity—the agent may slip and end up in a different tile than intended. This forces the agent to learn a policy that balances risk and reward, making it ideal for studying exploration, stochastic transitions, and reward sparsity in reinforcement learning.
6. Implementation
The main highlights of the code are:
- Q-table: Initialized to zeros.
- Transition table: Tracks how often a state transitions to another.
- Reward dictionary: Maps
(state, action, next_state)
to reward values. - play_n_random_episodes(n): Random exploration to populate experience.
- value_iteration(): Updates Q-values based on observed transitions.
- play_episode(): Runs one episode using the current Q-table greedily.
- Stopping condition: Training stops when the agent achieves an average reward above 0.8 across test episodes.
Comments
Post a Comment