Unit I
Unit I
Reinforcement Learning
RL solves a specific type of problems where decision making is sequential, and the
goal is long-term, such as game-playing, robotics, etc.
It is the core part of artificial intelligence. Here we do not need to pre-program the
agent instead it learns from its own experience without any human intervention.
Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
Key Features of Reinforcement Learning
Reinforcement problems are closed-loop problems because the learning system's
actions influence its later inputs.
The agent is not told which actions to take instead it must discover on its own which
actions yield the most reward by trying them out.
Actions may affect not only the immediate reward but also all subsequent rewards
as well as future states.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to get the
maximum positive rewards.
How reinforcement learning is different from supervised and unsupervised learning?
Reinforcement learning is different from supervised learning. In supervised learning
the agent is told what action it has to take on a particular situation. Whereas in
reinforcement learning the agent has not told what action it has to take on a
particular situation. It is the responsibility of the agent to decide what action it has to
take on a particular situation on its own.
Reinforcement learning is also different from unsupervised learning. The goal of
unsupervised learning is to extract the hidden patterns or structure from the
unlabelled data. But, in reinforcement learning instead of finding the hidden
structure, we focus on maximizing the reward.
So, reinforcement learning is different from both supervised/unsupervised learning and we
consider it as the third paradigm in the space of learning and computational intelligence.
One of the challenges that arise in reinforcement learning is the trade-off between
exploration and exploitation.
Exploitation - is a strategy of using the accumulated knowledge to make decisions that
maximize the expected reward.
Exploration - involves acquiring new information or opportunities that will maximize the
expected reward.
Let's suppose people A and B are digging in a coal mine in the hope of getting a diamond
inside it. Person B got success in finding the diamond before person A and walks off
happily. After seeing him, person A gets a bit greedy and thinks he too might get success
in finding diamond at the same place where person B was digging coal. This action
performed by person A is called greedy action, and this policy is known as a greedy
policy. But person A was unknown because a bigger diamond was buried in that place
where he was initially digging the coal, and this greedy policy would fail in this situation.
In this example, person A only got knowledge of the place where person B was digging
but had no knowledge of what lies beyond that depth. But in the actual scenario, the
diamond can also be buried in the same place where he was digging initially or some
completely another place. Hence, with this partial knowledge about getting more rewards,
our reinforcement learning agent will be in a dilemma on whether to exploit the partial
knowledge to receive some rewards or it should explore unknown actions which could
result in many rewards.
Restaurant Selection
Exploitation Go to your favorite restaurant
Exploration Try a new restaurant
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
Robotics
Robotics is an important sector which uses RL extensively. Almost all robots are
designed to grasp the objects.
For example a vacuum cleaner robot automatically cleans the rooms without
human intervention.
Similarly a robotic arm in a manufacturing industry can automatically fit various
parts of a car.
2. Reward Signal
A reward signal defines the goal of a reinforcement problem
The reward signal defines the good and bad events of the agent.
The objective of an agent is to maximize the total reward received over the long
run.
3. Value Function
The value of a state is the total amount of reward an agent can expect to
accumulate in the future starting from that state.
Reward signal indicate the immediate response whereas value indicate the long-
term response.
Value function indicates what is good in the long-term.
A state may have a low immediate reward but still have a high value. How it is
possible? It is because it may be followed by a high reward states.
If the board fills up with neither player getting three in a row, the game is a draw.
Although this is a simple problem, it cannot readily be solved in a satisfactory way through
classical techniques. The reinforcement learning solves this problem by estimating the
values of states. The following figure shows the state space tree of tic-tac-toe game. In
this each node represents a state.
The reinforcement learning uses a value function to estimate the values of states. First we
set up a table of numbers, one for each possible state of the game. Each number will be
the latest estimate of the probability of our winning from that state. We treat this estimate
as the state's value, and the whole table is the learned value function. If state A has higher
value than state B then we say that the probability of winning from A is higher than from B.
In tic-tac-toe game we can only know the values of terminal states as shown below.
…..
…..
…..
For all other states, the values
can be estimated using a value
function as given below.