What Is Reinforcement Learning? Definition and Applications

Reinforcement learning is all about gamifying the learning process.

The machine learning tools here use a reward-penalty method to teach an AI system. If it makes the right move, it gets rewarded. If it makes a mistake, it receives a penalty.

In other words, reinforcement learning forces a system to learn and adapt quickly, or it otherwise loses serious numerical rewards. It's a feedback-based machine learning method in which the AI agent learns to (rightly) behave in an environment by taking actions and seeing those actions' results.

In short, the agent learns from experience without any pre-programming and doesn't require any human supervision.

What is reinforcement learning?

Reinforcement learning (RL) is a machine learning technique that focuses on how AI agents should take actions in an environment to achieve the best results. This training is done in real time with continuous feedback to maximize the possibility of being rewarded.

It's one of the three basic machine learning types. The other two are supervised and unsupervised learning.

Reinforcement learning lets a machine learn from its mistakes, similar to how humans do. It's a type of machine learning in which the machine learns to solve a problem using trial and error. Also, the machine learns from its actions, unlike supervised learning, where historical data plays a critical role.

The AI system that undergoes the learning process is called the agent or the learner. The learning system explores and observes the environment around it, just like us. If the agent performs the right action, it receives positive feedback or a positive reward. If it takes an adverse action, it receives negative feedback or a negative reward.

Notable characteristics of reinforcement learning (RL) are:

Time plays a critical role in RL problems.
The agent's decision-making is sequential.
There isn't a supervisor, and the agent isn't given any instructions. There are only rewards.
The agent's actions directly affect the subsequent data it receives.
The agent is rewarded (positive or negative) for each action.
The best solution to a problem is decided based on the maximum reward.

The goal of reinforcement learning is to choose the best-known action for any given state. This also means that the actions have to be ranked and assigned values relative to one another. Since the best action depends on the agent’s current state, the focus is more on state-action pairs’ values.

However, reinforcement learning isn't always the answer to all situations. For example, if you have enough data to solve a problem, supervised learning will be ideal. Also, reinforcement learning is a time-consuming process and requires a lot of computational resources.

Related: Learn about reinforcement learning from human feedback (RLHF) and how it helps train large language models (LLMs).

Elements of reinforcement learning

Apart from the agent and the environment, there are four critical elements in reinforcement learning: policy, reward signal, value function, and model.

1. Policy

The policy is the strategy the agent uses to determine the following action based on the current state. It's one of the critical elements of reinforcement learning and can single-handedly define the agent's behavior.

A policy maps the perceived states of the environment to the actions taken on those particular states. It can be deterministic or stochastic and can also be a simple function or a lookup table.

2. Reward signal

At each state, the agent receives an immediate signal from the environment called the reward signal or simply reward. As mentioned earlier, rewards can be positive or negative, depending on the agent's actions. The reward signal can also force the agent to change the policy. For example, if the agent’s actions lead to negative rewards, the agent will be forced to change the policy for the sake of its total reward.

3. Value function

Value function gives information about how favorable specific actions are and how much reward the agent can expect. Simply put, the value function determines how good a state is for the agent to be in. The value function depends on the agent’s policy and the reward, and its goal is to estimate values to achieve more rewards.

4. Model

The model mimics the behavior of the environment. Using a model, you can make inferences about the environment and how it’ll behave. For example, if a state and an action are provided, you can use a model to predict the next state and reward.

Since the model lets you consider all the future situations before experiencing them, you can use it for planning. The approach used for solving reinforcement learning problems with the model’s help is called model-based reinforcement learning. On the other hand, if you try solving RL problems without using a model, it's called model-free reinforcement learning.

While model-based learning tries to choose the optimal policy based on the learned model, model-free learning demands the agent learn from trial-and-error experience. Statistically, model-free methods are less efficient than model-based methods.

Types of reinforcement learning

There are two types of reinforcement learning methods: positive reinforcement and negative reinforcement.

Positive reinforcement

Positive reinforcement learning is the process of encouraging or adding something when an expected behavior pattern is exhibited to increase the likelihood of the same behavior being repeated.

For example, if a child passes a test with impressive grades, they can be positively reinforced with an ice cream cone.

Negative reinforcement

Negative reinforcement involves increasing the chances of specific behavior to occur again by removing the negative condition.

For example, if a child fails a test, they can be negatively reinforced by taking away their video games. This is not precisely punishing the child for failing, but removing a negative condition (in this case, video games) that might have caused the kid to fail the test.

How does reinforcement learning work?

Simply put, reinforcement learning is an agent's quest to maximize the reward it receives. There's no human to supervise the learning process, and the agent makes sequential decisions.

Unlike supervised learning, reinforcement learning doesn't demand you to label data or correct suboptimal actions. Instead, the goal is to find a balance between exploration and exploitation.

Exploration is when the agent learns by leaving its comfort zone, and doing so might put its reward at stake. Exploration is often challenging and is like entering uncharted territory. Think of it as trying a restaurant you’ve never been to. In the best-case scenario, you might end up discovering a new favorite restaurant and giving your taste buds a treat. In the worst-case scenario, you might end up sick due to improperly cooked food.

Exploitation is when the agent stays in its comfort zone and exploits the currently available knowledge. It's risk-free as there's no chance of attracting a penalty and the agent keeps repeating the same thing. It's like visiting your favorite restaurant every day and not being open to new experiences. Of course, it's a safe choice, but there might be a better restaurant out there.

Reinforcement learning is a trade-off between exploration and exploitation. RL algorithms can be made to both explore and exploit at varying degrees.

Reinforcement learning is an iterative process. The agent starts with no hint about the rewards it can expect from specific state-action pairs. It learns as it goes through these states multiple times and eventually becomes adept. In short, the agent starts as a noob and slowly becomes a pro.

Reinforcement learning example

Since reinforcement learning is how most organisms learn, let's look at how a dog learns new tricks, and compare them with this machine learning type.

Charlie is a Golden Retriever. Like other dogs, he doesn't understand English or any human language per se, although he can comprehend intonation and human body language with excellent accuracy.

This means that we can't directly instruct Charlie on what to do, but we can use treats to entice him into doing something. It could be anything as simple as sitting or rolling over on command or shaking hands. For this example, let's consider the “act of shaking hands”.

As you probably know, the rules are pretty simple. If Charlie shakes hands or does something similar, he gets a treat. If he doesn't obey or misbehaves, he won't get any treats.

In other words, if Charlie performs the desired action, he gets a treat; otherwise, none.

After a few "treat or no treat" iterations, Charlie will recognize the right set of actions to perform to get a treat. When he misbehaved, he realized that such unfavorable actions led to unfavorable consequences. In the future, when Charlie faces similar situations, he’ll know which is the most desirable action to take to maximize the treat or reward.

“RL means that AI can now be applied to sequential decision-making problems to achieve strategic goals, as opposed to one-off perceptive tasks like image recognition.”
Chris Nicholson
Founder and CEO of Pathmind

Applying the concept of reinforcement learning to this example makes Charlie the agent. The house he lives in becomes his environment, and the treat he receives is his reward. Sitting is a state, so is shaking hands. The transition from sitting to shaking hands can be considered an action.

Your body language and intonation trigger the action (or, in this context, reaction). The method of selecting an action based on the state that’ll help you get the best outcome is called the policy.

Whenever Charlie makes the desired action and transitions from one state (sitting) to another (shaking hands), he receives a treat. Since Charlie is a good boy, we don't punish him if he misbehaves. Instead of a penalty or punishment, he won’t get a reward if he doesn't perform the desired action, which is something closer to a penalty.

This is closely similar to how an agent learns in reinforcement learning.

Reinforcement learning in gaming

Games and reinforcement learning share a long history. Games are the optimal and challenging domains to test reinforcement learning algorithms.

We've all played computer or video games at some point in our lives. It could have been one of the 8-bit Atari games, a console game like Halo, or a board game like chess.

Regardless of the game you played, it may have taken a few attempts to understand the rules before you finally win a game. In short, it takes time, strategy, and practice to become a pro. And, of course, there's motivation in the form of in-game points or rewards. You get a positive reward when you complete a mission. You score negative points if you fall off a cliff or get arrested because your wanted level is higher than it's supposed to be.

Irrespective of the game's complexity, the above concepts remain universal. If your in-game behavior is in line with the game's instructions, you'll gain points and win. Otherwise, you'll lose points and fail. The same rules apply to reinforcement learning.

Let's take a look at how you can teach a machine to play games.

The human brain can naturally recognize the purpose of a game, but it's difficult for machines. You could apply supervised learning to teach machines, but this requires training data from previous human players. Since our skillset will eventually plateau, this means the RL agent could never get "better" than a human.

In reinforcement learning, there’s no training dataset nor output value. The agent can naturally compete, fail, and learn from its mistakes based on reward values and penalty values. Let's take the game of Pong as an example.

Source: ponggame.org

The purpose of Pong is to bounce the ball with your paddle so that it ends up behind the opponent. Initially, the agent won't understand this and fail numerous times. But at some point, it'll make a correct move and will be positively reinforced to repeat the action.

After several Pong games, the reinforcement learning agent should have a general understanding of the probability of moving UP successfully versus the probability of moving DOWN. These actions are reinforced until the total reward is maximized. In terms of Pong, this means winning the game without your opponent gaining a single point.

Reinforcement learning and AlphaGo

AlphaGo is the gold standard of advanced reinforcement learning in gaming. Developed by DeepMind, this deep learning computer program became the world's best Go player by defeating Ke Jie, one of the world’s top Go players.

Here's a quick look at how AlphaGo became the world champion:

AlphaGo, like any learning agent, started with zero knowledge of the game.
It was then fed the game's basic structure and strategy using thousands of examples from amateur and professional players.
It achieved a high skill level in three days, and the testers began playing the program against itself.
This led to constant iteration, reinforcement, and pairing with search algorithms.
AlphaGo soon became a different, more advanced version of itself – Fan, Lee, Master, and ultimately, Zero.
AlphaGo Master competed against the best human player, 18-time world champion Ke Jie.

In just 40 days of self-training, AlphaGo Zero outperformed AlphaGo Master and achieved an Elo rating above 5,000, which is essentially superhuman levels.

Markov decision process: Representing RL mathematically

The Markov decision process (MDP) is how reinforcement learning problems are represented mathematically. It’s used to formalize RL problems, and if the environment is fully observable, it can be modeled using MDP.

In MDP, the following parameters are used to get a solution for a reinforcement learning problem:

Set of possible states - S
Set of models
Set of possible actions- A
Reward - R
Policy
Value - V

The agent's state can be represented using the Markov state. The Markov state follows the Markov property, which means that the future state is independent of the past and can only be defined with the present.

Suppose the RL agent is in a maze environment consisting of four rows and four columns, which makes up a total number of 16 blocks. If the agent is on a particular block and its adjacent two blocks have the same assigned value (not to be confused with reward), it'll be difficult for the agent to choose between them.

In such situations, the Bellman equation is used. It's a critical constituent of reinforcement learning and helps in solving MDP. Solving means finding the optimal policy and value functions.

Key elements of the Bellman equation are:

Action
State
Reward
Discount factor

The Bellman equation is also associated with dynamic programming. It's used to calculate the values of a decision problem at a given point, considering the previous states’ values. With the equation, you can break down complex problems into simpler and recursive subproblems and find optimal solutions.

Approaches for implementing reinforcement learning

There are generally three ways to implement a reinforcement learning algorithm: value-based, policy-based, or model-based. These approaches determine how the agent will take action and interact with the environment.

Value-based reinforcement learning

This approach is about finding the optimal value function, which is essentially the maximum value at a state under any policy.

Policy-based reinforcement learning

In this approach, the agent tries to develop a policy so that the action performed in every state would help maximize the future reward.

The policy-based approach can be further divided into two:

Deterministic: In this sub-division, the same action is produced by the policy at any state.
Stochastic: Here, probability determines the action produced.

Model-based reinforcement learning

In this approach, a virtual model is created for each environment, and the agent explores it to learn. Since the model representation is different for each environment, there isn't a particular RL algorithm or solution for this approach.

Reinforcement learning algorithms

Reinforcement learning algorithms can be classified into two: model-free RL algorithms and model-based RL algorithms. Q-learning and deep Q learning are examples of model-free RL algorithms.

Q-learning

Q-learning is a value-based RL method of providing information. It's used for temporal difference learning and determines how good an action is at a particular state. Q-learning is an off-policy learner, meaning the agent will learn the value function based on the action derived from another policy.

What is temporal difference learning?

Temporal difference learning is an approach to predicting a quantity that depends on a particular signal’s future values.

Q-learning starts with the initialization of the Q-table. Then the agent selects an action and performs it. The reward for the action is measured, and then the Q-table is updated. A Q-table is a table or matrix created during Q-learning. After each action, the table is updated.

In Q-learning, the agent’s goal is to maximize the value of Q. In this method, the agent strives to find the best action to take at a particular state. The Q stands for quality, which indicates the quality of action taken by the agent.

Monte Carlo Method

The Monte Carlo (MC) method is one of the best ways an agent can get the best policy to gain the highest cumulative reward. This method can be used only in episodic tasks, which are tasks that have a definite end.

In the MC method, the agent learns directly from episodes of experience. This also means that the agent initially has no clue about which action leads to the highest reward, so the actions are chosen randomly. After selecting a bunch of random policies, the agent will become aware of the policies that lead to the highest rewards and get better at picking policies.

SARSA

State-action-reward-state-action (SARSA) is an on-policy temporal difference learning method. This means that it learns the value function based on the current action derived from the currently used policy.

SARSA reflects the fact that the main function used to update the Q-value depends on the agent's current state (S), the action chosen (A), the reward it gets for the action (R), the state the agent enters after performing the action (S), and the action it performs in the new state (A).

Deep Q neural network

Deep Q neural network (DQN) is Q-learning with the help of neural networks. It's ideal when the state and action spaces are significant, as defining a Q-table will be a complex and time-consuming task. Instead of a Q-table, neural networks determine the Q-values for each action based on the state.

Reinforcement learning applications

Reinforcement learning is used to teach AI systems to play games. And they're getting better at it exponentially. Apart from that, it's used in finance to evaluate trading strategies and can also be used in chemistry for optimizing chemical reactions. Companies can also use deep reinforcement learning to teach robots to pick and place goods correctly.

Here are additional more applications of RL:

Business strategy planning
Aircraft control and robot motion control
Industrial automation
Data processing
Augmented NLP
Recommendation systems
Bidding and advertising
Traffic light control

Challenges of reinforcement learning

Reinforcement learning is a powerful type of machine learning. However, it also has some related challenges.

First, reinforcement learning occurs in a delayed return environment. If the task at hand is complex, the longer it’ll take the agent to learn and achieve maximum rewards.

For example, an agent might take a few hours to learn the game of Pong, but AlphaZero took 40 days and millions of games to master Go. While it's still an outstanding achievement, it seems like a slow learning curve when looking at real-world applications like robotics.

Scaling or tweaking the neural networks that control the agent is also a big challenge. There are no other means of communicating with the agent other than the rewards and penalties. This also means that the agent might find a way to maximize the rewards without actually completing the assigned mission.

Reinforcement learning glossary

Artificial intelligence, can be a pretty overwhelming topic, especially when you learn new terms. Here is a revision of some of the terms used in reinforcement learning and what they mean.

Agent: The AI system that undergoes the learning process. Also called the learner or decision-maker. The algorithm is the agent.
Action: The set of all possible moves an agent can make.
Environment: The world through which the agent moves and receives feedback. The environment takes the agent's current state and action as input and then outputs the reward and the next state.
State: An immediate situation in which the agent finds itself. It can be a specific moment or position in the environment. It can also be a current as well as a future situation. In simple words, it's the agent’s state in the environment.
Reward: For every action made, the agent receives a reward from the environment. A reward could be positive or negative, depending on the action.
Policy: The strategy the agent uses to determine the next action based on the current state. In other words, it maps states to actions so that the agent can choose the action with the highest reward.
Model: The agent's view of the environment. It maps the state-action pairs to the probability distributions over states. However, not every RL agent uses a model of its environment.
Value function: In simple terms, the value function represents how favorable a state is for the agent. The state's value represents the long-term reward the agent will receive starting from that particular state to executing a specific policy.
Discount factor: Discount factor (γ) determines how much the agent cares about rewards in the distant future when compared to those in the immediate future. It's a value between zero and one. If the discount factor equals 0, the agent will only learn about actions that produce immediate rewards. If it's equal to 1, the agent will evaluate its actions based on the sum of its future rewards.
Dynamic programming (DP): An algorithmic technique used to solve an optimization problem by breaking it down into subproblems. It follows the concept that the optimal solution to the overall problem depends on the optimal solution to its subproblems.

If these terms overwhelm you, think about what reinforcement learning would be in real life. The agent is you, and the environment is your surroundings and the laws of physics like gravity.

If you're learning to walk, the state could be the position of your legs. If you take the best action, you get a reward, which is walking a few steps. Otherwise, you get a penalty, which in this case means you fall and hurt yourself.

It’s game time for the robots

Humans love rewards. Gamification is the easiest way to tempt us into completing a task without feeling demotivated. It's why playing a sport seems more fun than working out at a gym.

Reinforcement learning is luring AI agents to make the right decisions in exchange for rewards. We're yet to hear what the robots think about gamification, but we hope they like it.

Some say it's the last invention we'll ever need. Some feel it's an unattainable goal. It's called artificial general intelligence and, in effect, would be our greatest invention or the biggest threat ever.

Amal Joby

Amal is a Research Analyst at G2 researching the cybersecurity, blockchain, and machine learning space. He's fascinated by the human mind and hopes to decipher it in its entirety one day. In his free time, you can find him reading books, obsessing over sci-fi movies, or fighting the urge to have a slice of pizza.

Explore More G2 Articles

Best Photo Editing Software

vector graphics software

Corporate Culture

Vaccine Tracking