Up until recently, a true understanding of MDPs has eluded me despite my great interest in Reinforcement Learning. And as I like to do, I’ll be sharing this newly acquired knowledge in the most understandable way possible, i.e without any bogus formulas (for now). So grab a pizza and let’s roll.
The Markov Decision Process (MDP) is used to model an environment for an agent to learn. It is a mathematical representation of the environment. The MDP is made up of 4 parts:
- States
- Actions
- Transition Function
- Reward
This is what the MDP looks like:
<S, A, T, R>
States
A state is used to represent the current situation of the environment from the perspective of the agent. In a chess game for example, it is the configuration of the pieces on the board. It’s important to note that in the real world, the set of states, S is not always finite. Also, not all states are directly reachable from every other state (think of a directed graph data structure). A state is a node in a graph data structure.
Actions
This is self-descriptive. An action is how an agent gets from one state s to another state s’. Like in the chess game, one move by a player (action) changes the configuration (state). So the second object A in the MDP tuple is a set of all actions that the agent can take in the world. In the game of chess, this translates to things like move the queen to B4, or move a knight to C2. Just like states, not all actions can be taken from every state. For example, you can’t move the castle at the beginning of the game. You can think of an action as an edge in a directed graph data structure.
Transition Function
Think about how things don’t always go as planned, hurts right? Yep!!! In an agents world, this can also be the case. The transition function is the probability that an agent gets to the next state given that she takes an action from her current state. It is written like this: T(s,a,s’). In essence, this is P(s’|s,a). This is the probability of getting to state s’ given that we take action a in state s. In a deterministic world, these values will equal 1 because we’re 100% sure that we will end up where we expect, given an action. We can configure this when modelling the environment.
Rewards
We use some notion of reward to help the agent take the next best step. This function rewards an agent for taking the right actions and punishes (with negative rewards) for wrong actions. This can also be configured while modelling the environment. It’s usually written this way:
R(s,s’): The reward for moving from state s to state s’.
One more thing
The purpose of the MDP is to find a policy, π(s). This policy specifies the action to be taken by the agent when in a state s.
Also, the idea behind the MDP is that decisions about the future can be made only on the current state and none other. In other words, to make a decision about what action to take at time t+1, we only have to consider the environment at time t and no more. We’ll get to this in a subsequent article.
Now that we have laid the foundation, we can go into the mathematics. In a later article, we’d talk about finding π*, which is the optimal π. We’ll talk about the Q-function, action-value function and MDP optimization.