SARSA Agent
- class rlforge.agents.tabular.sarsa.SarsaAgent(step_size, discount, num_states, num_actions, epsilon=0.1)
Tabular agent implementing the SARSA algorithm.
SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference learning method. Unlike Q-learning, which updates toward the maximum action value in the next state, SARSA updates toward the value of the action actually taken under the current policy. This makes SARSA sensitive to the agent’s exploration strategy.
Notes
The agent uses an epsilon-greedy policy for action selection.
This implementation does not include planning steps; it directly inherits from
BaseAgent.
- end(reward)
Complete an episode.
Performs a final update to the Q-value of the last state-action pair using the terminal reward.
Parameters
- rewardfloat
The terminal reward received at the end of the episode.
- reset()
Reset the agent’s internal state.
Initializes the Q-table to zeros at the start of training or between episodes.
- select_action(q_values)
Select an action using epsilon-greedy exploration.
Parameters
- q_valuesnumpy.ndarray
Array of Q-values for the current state.
Returns
- actionint
The chosen action.
- start(new_state)
Begin a new episode.
Selects the first action using the epsilon-greedy policy and stores the initial state-action pair.
Parameters
- new_stateint
The initial state observed from the environment.
Returns
- actionint
The first action selected by the agent.
- step(reward, new_state)
Take a step in the environment.
Updates Q-values using the SARSA update rule, which incorporates the action actually taken in the next state.
Parameters
- rewardfloat
Reward received from the previous action.
- new_stateint
The new state observed from the environment.
Returns
- actionint
The next action chosen by the agent.
Notes
The Q-value update follows:
\[Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma Q(s', a') - Q(s, a) \Big]\]where \(\alpha\) is the step size, \(\gamma\) is the discount factor, and \(a'\) is the action actually taken in the next state.