SARSA (State-Action-Reward-State-Action) in Reinforcement Learning
Last Updated :
17 Jul, 2025
SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning (RL) algorithm that helps an agent to learn an optimal policy by interacting with its environment. The agent explores its environment, takes actions, receives feedback and continuously updates its behavior to maximize long-term rewards.
Unlike off-policy algorithms like Q-learning which learn from the best possible actions, it updates its knowledge based on the actual actions the agent takes. This makes it suitable for environments where the agent's actions and their immediate feedback directly influence learning.
Key Components of the SARSA Algorithm
Key components of the SARSA Algorithm are as follows:
- State (S): The current situation or position in the environment.
- Action (A): The decision or move the agent makes in a given state.
- Reward (R): The immediate feedback or outcome the agent receives after taking an action.
- Next State (S'): The state the agent transitions to after taking an action.
- Next Action (A'): The action the agent will take in the next state based on its current policy.
SARSA focuses on updating the agent's Q-values (a measure of the quality of a given state-action pair) based on both the immediate reward and the expected future rewards.
How does SARSA Updates Q-values?
The main idea of SARSA is to update the Q-value for each state-action pair based on the actual experience. The Q-value represents the expected cumulative reward the agent can achieve starting from a given state and action.
SARSA updates the Q-value using the Bellman Equation for SARSA:
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
Where:
- Q(s_t, a_t) is the current Q-value for the state-action pair at time step t.
- α is the learning rate (a value between 0 and 1) which determines how much the Q-values are updated.
- r_{t+1} is immediate reward the agent receives after taking action a_t in state s_t.
- γ is the discount factor (between 0 and 1) which shows the importance of future rewards.
- Q(s_{t+1}, a_{t+1}) is the Q-value for the next state-action pair.
Breaking Down the Update Rule
- Immediate Reward: The agent receives an immediate reward r_{t+1} after taking action a_t in state s_t.
- Future Reward: The expected future reward is calculated as Q(s_{t+1}, a_{t+1}), the Q-value of the next state-action pair.
- Correction: The agent updates the Q-value for the current state-action pair based on the difference between the predicted reward and the actual reward received.
This update rule allows the agent to adjust its policy incrementally, improving decision-making over time.
SARSA Algorithm Steps
Lets see how the SARSA algorithm works step-by-step:
1. Initialize Q-values: Begin by setting arbitrary values for the Q-table (for each state-action pair).
2. Choose Initial State: Start the agent in an initial state s_0.
3. Episode Loop: For each episode (a complete run through the environment) we set the initial state s_t and choose an action a_t based on a policy like \varepsilon.
4. Step Loop: For each step in the episode:
- Take action a_t observe reward R_{t+1} and transition to the next state s_{t+1}.
- Choose the next action a_{t+1} based on the policy for state s_{t+1}.
- Update the Q-value for the state-action pair (s_t, a_t) using the SARSA update rule.
- Set s_t = s_{t+1} and a_t = a_{t+1}.
5. End Condition: Repeat until the episode ends either because the agent reaches a terminal state or after a fixed number of steps.
Implementing SARSA in Grid World using Python
Let’s consider a practical example of implementing SARSA in a Grid World environment where the agent can move up, down, left or right to reach a goal.
Step 1: Defining the Environment (GridWorld)
- Start Position: Initial position of the agent.
- Goal Position: Target the agent aims to reach.
- Obstacles: Locations the agent should avoid with negative rewards.
- Rewards: Positive rewards for reaching the goal, negative rewards for hitting obstacles.
The GridWorld environment simulates the agent's movement, applying the dynamics of state transitions and rewards.
Here we will be using Numpy and Pandas libraries for its implementation.
Python
import numpy as np
import random
class GridWorld:
def __init__(self, width, height, start, goal, obstacles):
self.width = width
self.height = height
self.start = start
self.goal = goal
self.obstacles = obstacles
self.state = start
def reset(self):
self.state = self.start
return self.state
def step(self, action):
x, y = self.state
if action == 0:
x = max(x - 1, 0)
elif action == 1:
x = min(x + 1, self.height - 1)
elif action == 2:
y = max(y - 1, 0)
elif action == 3:
y = min(y + 1, self.width - 1)
next_state = (x, y)
if next_state in self.obstacles:
reward = -10
done = True
elif next_state == self.goal:
reward = 10
done = True
else:
reward = -1
done = False
self.state = next_state
return next_state, reward, done
Step 2: Defining the SARSA Algorithm
The agent uses the SARSA algorithm to update its Q-values based on its interactions with the environment, adjusting its behavior over time to reach the goal.
Python
def sarsa(env, episodes, alpha, gamma, epsilon):
Q = np.zeros((env.height, env.width, 4))
for episode in range(episodes):
state = env.reset()
action = epsilon_greedy_policy(Q, state, epsilon)
done = False
while not done:
next_state, reward, done = env.step(action)
next_action = epsilon_greedy_policy(Q, next_state, epsilon)
Q[state[0], state[1], action] += alpha * (reward + gamma * Q[next_state[0], next_state[1], next_action] - Q[state[0], state[1], action])
state = next_state
action = next_action
return Q
Step 3: Defining the Epsilon-Greedy Policy
The epsilon-greedy policy balances exploration and exploitation:
- With probability ϵ, the agent chooses a random action (exploration).
- With probability 1−ϵ, it chooses the action with the highest Q-value for the current state (exploitation).
Python
def epsilon_greedy_policy(Q, state, epsilon):
if random.uniform(0, 1) < epsilon:
return random.randint(0, 3)
else:
return np.argmax(Q[state[0], state[1]])
Step 4: Setting Up the Environment and Running SARSA
This step involves:
- Defining the grid world parameters like width, height, start, goal, obstacles.
- Setting the SARSA hyperparameters like episodes, learning rate, discount factor, exploration rate.
- Running the SARSA algorithm and printing the learned Q-values.
Python
if __name__ == "__main__":
width = 5
height = 5
start = (0, 0)
goal = (4, 4)
obstacles = [(2, 2), (3, 2)]
env = GridWorld(width, height, start, goal, obstacles)
episodes = 1000
alpha = 0.1
gamma = 0.99
epsilon = 0.1
Q = sarsa(env, episodes, alpha, gamma, epsilon)
print("Learned Q-values:")
print(Q)
Output:
Learned Q-valuesAfter running the SARSA algorithm the Q-values represent the expected cumulative reward for each state-action pair. The agent uses these Q-values to make decisions in the environment. Higher Q-values shows better actions for a given state.
Exploration Strategies in SARSA
SARSA uses an exploration-exploitation strategy to choose actions. A common strategy is ε-greedy:
- Exploration: With probability ε, the agent chooses a random action (exploring new possibilities).
- Exploitation: With probability 1−ε, the agent chooses the action with the highest Q-value for the current state (exploiting its current knowledge).
Over time, ε is often decayed to shift from exploration to exploitation as the agent gains more experience in the environment.
Advantages
- On-Policy Learning: It updates Q-values based on the agent’s actual actions which makes it realistic for environments where exploration and behavior directly influence learning.
- Real-World Behavior: The agent learns from real experiences, leading to grounded decision-making that reflects its actual behavior in uncertain situations.
- Gradual Improvement: It is more stable than off-policy methods like Q-learning when exploration is needed to discover optimal actions.
Limitations
- Slower Convergence: It tends to converge more slowly than off-policy methods like Q-learning in environments that require heavy exploration.
- Sensitive to Exploration Strategy: Its performance is highly dependent on the exploration strategy used and improper management can delay or hinder learning.
By mastering SARSA, we can build more adaptive agents capable of making grounded decisions in uncertain environments.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis. Revolves around two primary Data structures: Series (1D) and DataFrame (2D)Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformat
6 min read
NumPy Tutorial - Python LibraryNumPy is a core Python library for numerical computing, built for handling large arrays and matrices efficiently.ndarray object â Stores homogeneous data in n-dimensional arrays for fast processing.Vectorized operations â Perform element-wise calculations without explicit loops.Broadcasting â Apply
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice