RL Unit 1
RL Unit 1
UNIT-I
Basics of probability and linear algebra, Definition of a stochastic
multi-armed bandit, Definition of regret, achieving sub linear regret,
UCB algorithm, KL-UCB, Thompson Sampling
Reinforcement Learning :
One key feature of reinforcement learning is the use of a reward signal, which
provides feedback to the agent based on its actions taken. The agent's goal is
to learn to select actions that maximize long-term cumulative rewards, rather
than optimizing for immediate rewards.
Example:
Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent
interacts with the environment by performing some
actions, and based on those actions, the state of the agent
gets changed, and it also receives a reward or penalty as
feedback.
The agent continues doing these three things (take action,
change state/remain in the same state, and get
feedback), and by doing these actions, he learns and explores
the environment.
Applications of Reinforcement learning:
It has applications in various domains including
Robotics navigation
Game playing,
Autonomous vehicles,
Finance,
Healthcare
Marketing strategy control
Webpage indexing and more
It has been successfully used to train agents that can play complex
games, control robotic systems, optimize resource allocation, and
make decisions in uncertain and dynamic environments.
(q) State key constituents of reinforcement learning. (Explain key terms in
reinforcement learning.)
Agent(): An entity that can explore the environment and act upon it.
Environment : A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
Action : Actions are the moves taken by an agent within the environment.
State : State is a situation returned by the environment after each action
taken by the agent.
Reward : A feedback returned to the agent from the environment to
evaluate the action of the agent.
In RL, the agent is not instructed about the environment and what
actions need to be taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the
feedback of the previous action.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to
reach to get the maximum positive rewards.
1. Policy: (Rule)
The policy is the core of a reinforcement learning and it is alone
sufficient to determine behavior.
Policy is an agent behavior function
It defines agent behavior (action) to take in a given situation
A policy is a function that maps agent current state to an action
In general, policies may be stochastic, specifying probabilities
for each action or deterministic
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
2. Reward signal:
A reward signal defines the goal of a reinforcement learning
problem.
A reward is a numerical value which sent the environment to
the reinforcement learning agent every time it perform the
action(feedback)
The goal of RL agent is to maximize the total reward it receives
over the long run.
The reward signal is the primary basis for altering the policy; if
an action selected by the policy is followed by low reward, then
the policy may be changed to select some other action in that
situation in the future .i.e. the policy changes its behavior based
up on the reward signal.
In general, reward signals may be stochastic functions of the
state of the environment and the actions taken.
These are some of the basic concepts where probability comes into
play in reinforcement learning. Understanding and effectively using
probabilities is essential for agents to learn optimal policies and make
informed decisions in uncertain environments.
If you knew the value of each action, then it would be trivial to solve
the k-armed bandit problem: you would always select the action with
highest value
We denote the estimated value of action a at time step t as
Qt*(a). We would like Qt*(a) to be close to q*(a).
Action-value Methods:
The methods for estimating the values of actions and for using
the estimates to make action selection decisions, which we
collectively call action-value methods.
One natural way to estimate this is by averaging the rewards
actually receive
Where argmaxa denotes the action a for which the expression that
follows is maximized.
Greedy action selection always exploits current knowledge to
maximize immediate reward
Definition of regret:
In the context of decision-making and reinforcement learning, regret
is a concept that measures the performance of an agent or algorithm.
where: