• It is basically the concept where machines can teach themselves depending upon the results of their own actions. Without further delay, let’s start. How does reinforcement learning work? 1.Start in a state. 2.Take an action. 3.Receive a reward or penalty from the environment. 4.Observe the new state of the environment. 5.Update your policy to maximize future rewards. How it works Teminologies • Agent – is the sole decision-maker and learner • Environment – a physical world where an agent learns and decides the actions to be performed • Action – a list of action which an agent can perform • State – the current situation of the agent in the environment • Reward – For each selected action by agent, the environment gives a reward. It’s usually a scalar value and nothing but feedback from the environment • Policy – the agent prepares strategy(decision-making) to map situations to actions. • Value Function – The value of state shows up the reward achieved starting from the state until the policy is executed • Model – Every RL agent doesn’t use a model of its environment. The agent’s view maps state-action pairs probability distributions over the states Elements of reinforcement learning (policy ) • Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived states of the environment to the actions taken on those states. A policy is the core element of the RL as it alone can define the behavior of the agent. it may be a simple function or a lookup table, It could also be deterministic or a stochastic policy: • For deterministic policy: a = π(s) • For stochastic policy: π(a | s) = P[At =a | St = s] Elements of reinforcement learning (Reward Signal) • Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the environment sends an immediate signal to the learning agent, and this signal is known as a reward signal. These rewards are given according to the good and bad actions taken by the agent. The agent's main objective is to maximize the total number of rewards for good actions. The reward signal can change the policy, such as if an action selected by the agent leads to low reward, then the policy may change to select other actions in the future. Elements of reinforcement learning(Value Function) • Value Function: The value function gives information about how good the situation and action are and how much reward an agent can expect. A reward indicates the immediate signal for each good and bad action, whereas a value function specifies the good state and action for the future. • The value function depends on the reward as, without reward, there could be no value. The goal of estimating values is to achieve more rewards. Elements of reinforcement learning(Model) • Model: The last element of reinforcement learning is the model, which mimics the behavior of the environment. With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and an action are given, then a model can predict the next state and reward. • The model is used for planning, which means it provides a way to take a course of action by considering all future situations before actually experiencing those situations. The approaches for solving the RL problems with the help of the model are termed as the model-based approach. Comparatively, an approach without using a model is called a model-free approach. Approaches to implement reinforcement learning algorithms Approaches to implement reinforcement learning algorithms • Value-Based – The main goal of this method is to maximize a value function. Here, an agent through a policy expects a long-term return of the current states. • Policy-Based – In policy-based, you enable to come up with a strategy that helps to gain maximum rewards in the future through possible actions performed in each state. Two types of policy-based methods are deterministic and stochastic. • Model-Based – In this method, we need to create a virtual model for the agent to help in learning to perform in each specific environment Types of reinforcement learning; Positive Reinforcement
• Positive reinforcement is defined as when an event,
occurs due to specific behavior, increases the strength and frequency of the behavior. It has a positive impact on behavior. • Advantages • Maximizes the performance of an action • Sustain change for a longer period • Disadvantage • Excess reinforcement can lead to an overload of states which would minimize the results. Types of reinforcement learning: Negative Reinforcement
• Negative Reinforcement is represented as the
strengthening of a behavior. In other ways, when a negative condition is barred or avoided, it tries to stop this action in the future. • Advantages 1. Maximized behavior 2. Provide a decent to minimum standard of performance • Disadvantage • It just limits itself enough to meet up a minimum behavior Types of Reinforcement Learning Widely used models for reinforcement learning • Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. • If the environment is completely observable, then its dynamic can be modeled as a Markov Process. In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment responds and generates a new state. Widely used models for reinforcement learning • Markov Decision Process(MDP’s) – are mathematical frameworks for mapping solutions in RL. The set of parameters that include:- • Set of finite states – S, • Set of possible Actions in each state – A, • Reward – R, • Model – T, • Policy – π.
The outcome of deploying an action to a state doesn’t depend on previous
actions or states but on current action and state. Markov Decision Process • Markov Property: • It says that "If the agent is present in the current state S1, performs an action a1 and move to the state s2, then the state transition from s1 to s2 only depends on the current state and future action and states do not depend on past actions, rewards, or states." • Finite MDP: • A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider only the finite MDP. • Markov Process: • Markov Process is a memoryless process with a sequence of random states S 1, S2, ....., St that uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S and transition function P. These two components (S and P) can define the dynamics of the system. Q Learning • it’s a value-based model free approach for supplying information to intimate which action an agent should perform. • It revolves around the notion of updating Q values which shows the value of doing action A in state S. • Value update rule is the main aspect of the Q-learning algorithm. • The main objective of Q-learning is to learn the policy which can inform the agent that what actions should be taken for maximizing the reward under what circumstances. • Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The temporal difference learning methods are the way of comparing temporally successive predictions. • It learns the value function Q (S, a), which means how good to take action "a" at a particular state "s.“ • The goal of the agent in Q-learning is to maximize the value of Q. Bellman equation • The value of Q-learning can be derived from the Bellman equation. Consider the Bellman equation given below: • equation was introduced by the Mathematician Richard Ernest Bellman in the year 1953, and hence it is called as a Bellman equation. It is associated with dynamic programming and used to calculate the values of a decision problem at a certain point by including the values of previous states • The equation has various components, including reward, discount factor (γ), probability, and end states s'. Bellman • Action performed by the agent is referred to as "a" • State occurred by performing the action is "s." • The reward/feedback obtained for each good and bad action is "R." • A discount factor is Gamma "γ.“
V(s) = max [R(s,a) + γV(s`)]
Where :- • V(s)= value calculated at a particular point. • R(s,a) = Reward at a particular state s by performing an action. • γ = Discount factor • V(s`) = The value at the previous state. Bell man equation • Agent has three values options, V(s1), V(s2), V(s3). • As this is MDP, so agent only cares for the current state and the future state. • The agent can go to any direction (Up, Left, or Right), so he needs to decide where to go for the optimal path. • Here agent will take a move as per probability bases and changes the state. if we want some exact moves, so for this, we need to make some changes in terms of Q-value. Q- represents the quality of the actions at each state. So instead of using a value at each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more lubricative than others, and according to the best Q-value, the agent takes his next move. The Bellman equation can be used for deriving the Q- value Bell man equation • To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain state, so the Q - value equation will be:
Hence, we can say that, V(s) = max [Q(s, a)]
Q Learning State Action Reward State action (SARSA): • SARSA stands for State Action Reward State action, which is an on-policy temporal difference learning method. The on- policy control method selects the action for each state while learning using a specific policy. • The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all pairs of (s-a). • The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the maximum reward for the next state is not required for updating the Q-value in the table. • In SARSA, new action and reward are selected using the same policy, which has determined the original action. State Action Reward State action (SARSA): • The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where, s: original state a: Original action r: reward observed while following the states s' and a': New state, action pair. Deep Q Neural Network (DQN) • :As the name suggests, DQN is a Q-learning using Neural networks. • For a big state space environment, it will be a challenging and complex task to define and update a Q- table. • To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-table, neural network approximates the Q-values for each action and state Practical Applications of reinforcement learning
• Robotics for Industrial Automation
• Text summarization engines, dialogue agents (text, speech), gameplays • Autonomous Self Driving Cars • Machine Learning and Data Processing • Training system which would issue custom instructions and materials with respect to the requirements of students • AI Toolkits, Manufacturing, Automotive, Healthcare, and Bots • Aircraft Control and Robot Motion Control • Building artificial intelligence for computer games Further reading • https://www.javatpoint.com/reinforcement-learning#Q-Learning