Expected SARSA Agent

class rlforge.agents.tabular.expected_sarsa.ExpectedSarsaAgent(step_size, discount, num_states, num_actions, epsilon=0.1, planning=False, planning_steps=0, exploration_bonus=0)

Tabular agent implementing the Expected SARSA algorithm.

This agent extends PlanningAgent and defines the Q-value update rule using the Expected SARSA method. Unlike standard SARSA, which updates based on the action actually taken, Expected SARSA computes the expected value of the next state’s Q-values under the current policy. This leads to smoother updates and often improved stability.

Notes

The agent uses an epsilon-greedy policy for action selection.
Planning steps can be enabled via the base class to simulate experience from the learned model.

end(reward)

Complete an episode.

Performs a final update to the Q-value of the last state-action pair.

Parameters

rewardfloat: The terminal reward received at the end of the episode.

planning_step()

Perform planning updates using the learned model.

Randomly samples stored transitions from the model and applies Q-value updates. An exploration bonus can be added to encourage revisiting less frequently updated state-action pairs.

reset()

Reset the agent’s internal state.

Initializes the Q-table to zeros. If planning is enabled, also initializes the tau matrix, which tracks the time since each state-action pair was last updated.

select_action(q_values)

Select an action using epsilon-greedy exploration.

Parameters

q_valuesnumpy.ndarray: Array of Q-values for the current state.

Returns

actionint: The chosen action.

start(new_state)

Begin a new episode.

Parameters

new_stateint: The initial state observed from the environment.

Returns

actionint: The first action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates Q-values based on the transition and, if planning is enabled, performs additional simulated updates using the learned model.

Parameters

rewardfloat: Reward received from the previous action.
new_stateint: The new state observed from the environment.

Returns

actionint: The next action chosen by the agent.

update_model(prev_state, prev_action, reward, new_state)

Update the agent’s model of the environment.

Stores the transition (state, action → next state, reward) in the model. For unseen states, initializes all actions with default transitions.

Parameters

prev_stateint: The previous state.
prev_actionint: The action taken in the previous state.
rewardfloat: The reward received.
new_stateint: The new state observed.

update_q_values(prev_state, prev_action, reward, new_state)

Update Q-values using the Expected SARSA update rule.

The update is based on the expected value of the next state’s Q-values, weighted by the probabilities of selecting each action under the epsilon-greedy policy.

Parameters

prev_stateint: The previous state index.
prev_actionint: The action taken in the previous state.
rewardfloat: The reward received after taking the action.
new_stateint: The new state index observed after the transition.

Notes

The probability distribution pi is constructed such that:
- Each non-greedy action has probability epsilon / num_actions.
- Greedy actions (those with maximum Q-value) share the remaining probability mass (1 - epsilon).
The Q-value update follows:

\[Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \sum_{a'} \pi(a' \mid s') Q(s', a') - Q(s, a) \Big]\]

where \(\alpha\) is the step size, \(\gamma\) is the discount factor, and \(\pi(a' \mid s')\) is the epsilon-greedy policy.