SARSA Agent

class rlforge.agents.tabular.sarsa.SarsaAgent(step_size, discount, num_states, num_actions, epsilon=0.1)

Tabular agent implementing the SARSA algorithm.

SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference learning method. Unlike Q-learning, which updates toward the maximum action value in the next state, SARSA updates toward the value of the action actually taken under the current policy. This makes SARSA sensitive to the agent’s exploration strategy.

Notes

The agent uses an epsilon-greedy policy for action selection.
This implementation does not include planning steps; it directly inherits from BaseAgent.

end(reward)

Complete an episode.

Performs a final update to the Q-value of the last state-action pair using the terminal reward.

Parameters

rewardfloat: The terminal reward received at the end of the episode.

reset()

Reset the agent’s internal state.

Initializes the Q-table to zeros at the start of training or between episodes.

select_action(q_values)

Select an action using epsilon-greedy exploration.

Parameters

q_valuesnumpy.ndarray: Array of Q-values for the current state.

Returns

actionint: The chosen action.

start(new_state)

Begin a new episode.

Selects the first action using the epsilon-greedy policy and stores the initial state-action pair.

Parameters

new_stateint: The initial state observed from the environment.

Returns

actionint: The first action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates Q-values using the SARSA update rule, which incorporates the action actually taken in the next state.

Parameters

rewardfloat: Reward received from the previous action.
new_stateint: The new state observed from the environment.

Returns

actionint: The next action chosen by the agent.

Notes

The Q-value update follows:

\[Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma Q(s', a') - Q(s, a) \Big]\]

where \(\alpha\) is the step size, \(\gamma\) is the discount factor, and \(a'\) is the action actually taken in the next state.