Expected SARSA Agent
- class rlforge.agents.tabular.expected_sarsa.ExpectedSarsaAgent(step_size, discount, num_states, num_actions, epsilon=0.1, planning=False, planning_steps=0, exploration_bonus=0)
Tabular agent implementing the Expected SARSA algorithm.
This agent extends
PlanningAgentand defines the Q-value update rule using the Expected SARSA method. Unlike standard SARSA, which updates based on the action actually taken, Expected SARSA computes the expected value of the next state’s Q-values under the current policy. This leads to smoother updates and often improved stability.Notes
The agent uses an epsilon-greedy policy for action selection.
Planning steps can be enabled via the base class to simulate experience from the learned model.
- end(reward)
Complete an episode.
Performs a final update to the Q-value of the last state-action pair.
Parameters
- rewardfloat
The terminal reward received at the end of the episode.
- planning_step()
Perform planning updates using the learned model.
Randomly samples stored transitions from the model and applies Q-value updates. An exploration bonus can be added to encourage revisiting less frequently updated state-action pairs.
- reset()
Reset the agent’s internal state.
Initializes the Q-table to zeros. If planning is enabled, also initializes the
taumatrix, which tracks the time since each state-action pair was last updated.
- select_action(q_values)
Select an action using epsilon-greedy exploration.
Parameters
- q_valuesnumpy.ndarray
Array of Q-values for the current state.
Returns
- actionint
The chosen action.
- start(new_state)
Begin a new episode.
Parameters
- new_stateint
The initial state observed from the environment.
Returns
- actionint
The first action selected by the agent.
- step(reward, new_state)
Take a step in the environment.
Updates Q-values based on the transition and, if planning is enabled, performs additional simulated updates using the learned model.
Parameters
- rewardfloat
Reward received from the previous action.
- new_stateint
The new state observed from the environment.
Returns
- actionint
The next action chosen by the agent.
- update_model(prev_state, prev_action, reward, new_state)
Update the agent’s model of the environment.
Stores the transition (state, action → next state, reward) in the model. For unseen states, initializes all actions with default transitions.
Parameters
- prev_stateint
The previous state.
- prev_actionint
The action taken in the previous state.
- rewardfloat
The reward received.
- new_stateint
The new state observed.
- update_q_values(prev_state, prev_action, reward, new_state)
Update Q-values using the Expected SARSA update rule.
The update is based on the expected value of the next state’s Q-values, weighted by the probabilities of selecting each action under the epsilon-greedy policy.
Parameters
- prev_stateint
The previous state index.
- prev_actionint
The action taken in the previous state.
- rewardfloat
The reward received after taking the action.
- new_stateint
The new state index observed after the transition.
Notes
- The probability distribution pi is constructed such that:
Each non-greedy action has probability epsilon / num_actions.
Greedy actions (those with maximum Q-value) share the remaining probability mass (1 - epsilon).
The Q-value update follows:
\[Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \sum_{a'} \pi(a' \mid s') Q(s', a') - Q(s, a) \Big]\]where \(\alpha\) is the step size, \(\gamma\) is the discount factor, and \(\pi(a' \mid s')\) is the epsilon-greedy policy.