Unit 4
Unit 4
There are many different algorithms that tackle this issue. As a matter of
fact, Reinforcement Learning is defined by a specific type of problem and all
its solutions are classed as Reinforcement Learning algorithms. In the
problem, an agent is supposed to decide the best action to select based on
his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
A set of Models.
State
A State is a set of tokens that represent every state that the agent can be
in.
Model
A Model (sometimes called Transition Model) gives an action’s effect in a
state. In particular, T(S, a, S’) defines a transition T where being in state S
and taking an action ‘a’ takes us to state S’ (S and S’ may be the same). For
stochastic actions (noisy, non-deterministic) we also define a probability P(S’|
S,a) which represents the probability of reaching a state S’ if action ‘a’ is
taken in state S. Note Markov property states that the effects of an action
taken in a state depend only on that state and not on the prior history.
Actions
Action A is a set of all possible actions. A(s) defines the set of actions that
can be taken being in state S.
Reward
Policy
An agent lives in the grid. The above example is a 3*4 grid. The grid has a
START state(grid no 1,1). The purpose of the agent is to wander around the
grid to finally reach the Blue Diamond (grid no 4,3). Under all circumstances,
the agent should avoid the Fire grid (orange color, grid no 4,2). Also, the grid
no 2,2 is a blocked grid, it acts as a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent’s path, i.e., if there is a wall in the direction the agent
would have taken, the agent stays in the same place. So for example, if the
agent says LEFT in the START grid he would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond.
Two such sequences can be found:
Utility theory
Utility theory is a fundamental concept in economics and decision theory.
This theory provides a framework for understanding how individuals make
choices under uncertainty. The aim of this agent is not only to achieve the
goal but the best possible way to reach the goal. This idea suggests that
people give a value to each possible result of a choice showing how much
they like or are happy with that result. The aim is to get the highest expected
value, which is the average of the values of all possible results taking into
account how likely each one is to happen.
Components of a POMDP
A POMDP is formally defined by the following elements:
state s.
The key challenge in a POMDP is that the agent does not know its exact state
but has a belief or probability distribution over the possible states. This belief
is updated using the Bayes’ rule as new observations are made, forming a
belief update rule:
Where:
Bel(s) is the prior belief of being in state s.
Solving Techniques: