Introduction to Reinforcement Learning and the Bellman Equations

31 Jul, 2025

Last updated 1 week, 6 days

In reinforcement learning, the idea is to find a policy $π$ that maximises a reward if followed. A policy is literally just a model in this case, which is given the state of the current environment to pick the next action.

Each state in an environment has an associated value for how good that state is. Sometimes you're happy and have energy to write blog posts. Other times you're unhappy and don't have the energy to write blog posts.

There are many such states which each carry a different “value” in terms of how much I could be rewarded for being in that state.

This is where we define our reward. A reward is a scalar number that is given for reaching some terminal state which indicates the end of some trajectory of actions. The terminal state here might be something like the end of the day or a race to see who can publish something first. We get to decide what the reward is at the end of a process.

In reinforcement learning, we call a trajectory of events leading up to a terminal state an episode, and we can say that we will do best if we can take the set of actions that maximise the reward we achieve at the end of an episode.

Given this setup, how then can we pick a policy which will help ensure that we get the maximum reward if we follow it?

Markov Decision Processes

A Markov Process is any process in which each state transition is only dependent on the prior state and no other state before it. The chance of moving from one state to another in this process is described as a transition probability (often presented in a matrix)

A Markov Decision Process is an extension of a Markov Process to also account for actions and rewards.

In the MDP’s, we define the discounted future rewards value called $G_{t}$ , which measures the future benefit of being on a trajectory by calculating the total discounted reward if $k$ steps are taken from time-step $t$ .

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}

This discount factor $γ$ is usually a value between 0.9-0.99 as this makes makes rewards that are closer in time more valuable. Delayed gratification bad!

The Markov decision process defines a total reward which says that the total reward is the sum of all discounted rewards over the course of the trajectory.

All in all, an MDP goes from state to state with an agent (model) taking actions to generate the next state along with the reward for that state.

MDP’s thus produce a sequence $(s_{1}, a_{1}, r_{1}), (s_{2}, a_{2}, r_{2}), \dots$ , although in reality, we never really see the full state after an action and instead just receive a partial observation $o_{t}$ . Thus our sequence in practise is usually $s_{1}, o_{1}, a_{1}, r_{2}, s_{2}, o_{2}, a_{2}, r_{3}, \dots$

Policy and Expected Return

As we mentioned before, the policy $π$ refers to a model that outputs either next action given a state or next set of possible actions. We do this by estimating the expected return of being in one state over another and then define a policy which tries to maximise this value.

Maximise this Value? Value? How do we measure the Value of being in a state? That will be the core discussion for this blog, but first, let’s clarify some more things about the policy.

The policy model in an RL environment today is usually the thing being trained, such as the neural network which picks the actions in a game environment or an LLM that picks the next word being taught how to respond in a way humans prefer.

RL algorithms that target the policy for training are called Policy-Gradient methods and they are the precursor for many breakthroughs in modern AI such as RLHF and Test-Time Scaling.

The policy can be deterministic and output the specific next action, as in the case of some algorithms where we pick actions in a greedy way, or it can be stochastic and output a distribution over possible actions, as is the case with modern language models that sample from a distribution of possible next tokens.

Now let’s discuss how we can estimate the value of being in one state over another.

The Value Functions

Just use a neural network bro.

That’s all.

Just use a neural network lmao.

Okay but on a serious note, we basically want to find a way to estimate the Discounted Total Reward from the current state at time-step, $G_{t}$ . We define the State Value Function as the expected value of the total reward from the current state $s_{t}$ to the terminal state $s_{T}$ if the policy $π$ is followed.

V (s_{t} | π) = E [G_{t} | s_{t}, π]

This can be thought of as the average value that can be expected from being in a state as it doesn’t specifically account for the expected reward from picking any one specific action from that state. Informally, the state value tells us the long-term reward we can expect on average if we start in this state and follow the policy thereafter.

In addition to this, we have another value function that we work with, called the Action Value Function or the Q Value Function.

Q (s_{t}, a_{t} | π) = E [G_{t} | s_{t}, a_{t}, π]

This Q Value function tells us the discounted total reward we we would get if we took the action $a$ and followed the policy thereafter.

This Q Value Function is how all reinforcement learning algorithms connect future rewards such as the conclusion of a game to individual actions leading up to that reward.

The Q-Value Function lets us resolve the temporal credit assignment problem.

Optimal Policies and the Bellman Equations

Now that we have our State Value Function and our Q-Value Function, we can model the expected reward for different policies.

Note that while in the past we did use tabular methods to keep track of the value functions, today that is almost always intractable and we go straight towards using Neural Networks that take inputs of the state and output scalar rewards.

Recall that we want a policy that maximises the expected return if followed. For any fully observable MDP, we have a closed form proof that there is always a deterministic, stationary policy that maximises the value of every state.

If we know this optimal policy, then we get the optimal state-value function

V^{*} (s_{t}) = {max}_{π} [E [G_{t} | s_{t}, π]]

and the optimal Q-Value Function (called the iconic Q-Star function)

Q^{*} (s_{t}) = {max}_{π} [E [G_{t} | s_{t}, a_{t}, π]]

Conversely, let’s consider a scenario where we have managed to find the optimal Q-Values for all possible actions. In such an event, we can see which action is the highest expected return at each time step. We would then be able to derive the optimal policy implicitly without a model by simply always choosing the action with the highest return.

π (s_{t} | a_{t}) \larr \argmax_{a_{t}} [Q^{*} (s_{t}, a_{t})]

Now, if you’ve been around the RL sphere at all before, you must have at some point heard the name Bellman. The Bellman equations are basically a way to connect the two value functions together and show their relationship. Let’s have a look.

Motivating the Bellman Equations

Recall that the State Value function serves as a sort of average of all actions expected return of being in a state, conditioned on a policy.

Also recall that the Q Value Function presents action specific expected returns from the current state, conditioned on a policy.

We can then posit that the value function is simply a weighted of all the action values, weighted by the probability of each action.

V (s_{t}) = \sum_{a} π (a_{t} | s_{t}) \cdot Q (s_{t}, a_{t})

This is the average expected return of all actions from the current state.

Similarly, we can suggest that the Q value function is the expected return of the specific reward received in the current state plus the sum of the average expected return across all the possible next states after the action,

Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ \sum_{s_{t + 1}} P r (s_{t + 1} | s_{t}, a_{t}) V (s_{t + 1})

where $P r (s_{t + 1})$ is the transition probability of going to state $t + 1$ from the current state $s_{t}$ after taking action $a_{t}$ .

Fusing the two

Ok and now with this formulation, we have both equations with the same terms, meaning that we can substitute in one equation into the other to derive the Bellman Equations.

Let’s start by mixing the Q Value Function equation into the Value Function Equation to get a relationship between the State Value at the current time-step $t$ and the State Value at the proceeding time-step $t + 1$ .

V (s_{t}) = \sum_{a} π (a_{t} | s_{t}) \cdot Q (s_{t}, a_{t})

V (s_{t}) = \sum_{a} π (a_{t} | s_{t}) \cdot (r (s_{t}, a_{t}) + γ \sum P r (s_{t + 1} | s_{t}, a_{t}) V (s_{t + 1}))

Similarly, we can sub in the State Value Function into the Q Value function to get a relationship between the Action Value of the current time-step $t$ and the subsequent time-step $t + 1$ .

Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ \sum P r (s_{t + 1} | s_{t}, a_{t}) V (s_{t + 1})

Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ \sum P r (s_{t + 1} | s_{t}, a_{t}) \sum_{a} π (a_{t + 1} | s_{t + 1}) \cdot Q (s_{t + 1}, a_{t + 1})

Boom. With these two equations, we basically know that every change in either the State Value or a Q value, or the policy, causes a change in the Action Value and vice versa.

These two equations are the backbone for many algorithms that were built later and they are also an important theoretical relationship for the foundations of reinforcement learning.

Why? Great question!

Importance of Bellman Equations

The importance of Bellman Equations come down to a few important patterns in Reinforcement Learning that come up across all sorts of other RL algorithms.

Bootstrapping

Bootstrapping lets you basically update the intermediate State Values or Action Values as your algorithm takes each step along the trajectory instead of only at the end of the rollout. Our Bellman equations create a relationship between the proceeding and preceding steps which enable this.
Structured Credit Assignment

The Bellman equations in a way shape the way in which reward/ credit is assigned to each action that leads up to a terminal state. Namely, the equations value each step based on a discounted expected average reward. Our Bellman equations create a short term gratification bias at each stage to balance smaller but near-term rewards and larger but longer-horizon rewards.
Implicit Policies

After training our Value functions to optimality, we are able to use the Q-Value Bellman equation to implicitly sample from the optimal policy because as long as we pick the action with the highest Action Value, we will be guaranteed maximum rewards. i.e. $a_{t + 1} = \argmax_{a_{t + 1}} [Q^{*} (s, a)]$

Note that the reason the Q value is needed for implicit policies is because you need the transition probabilities and reward function to use the State Value function for predicting best next action which is unrealistic in most environments.

Aside from the above 3 points, the Bellman equations actually serve as the backbone for many off-policy algorithms- algorithms that do not attempt to optimise policy model directly. That road is one we will travel on a different day.

#AI #Math #RL