Q Learning Agent

class rlforge.agents.tabular.q_agent.QAgent(step_size, discount, num_states, num_actions, epsilon=0.1, planning=False, planning_steps=0, exploration_bonus=0)

Tabular agent implementing the Q-learning algorithm.

This agent extends PlanningAgent and defines the Q-value update rule using the off-policy Q-learning method. Unlike SARSA, which updates based on the action actually taken, Q-learning updates toward the maximum estimated action value in the next state. This makes Q-learning an off-policy algorithm that learns the optimal greedy policy regardless of the agent’s current behavior.

Notes

The agent uses an epsilon-greedy policy for action selection.
Planning steps can be enabled via the base class to simulate experience from the learned model.

end(reward)

Complete an episode.

Performs a final update to the Q-value of the last state-action pair.

Parameters

rewardfloat: The terminal reward received at the end of the episode.

planning_step()

Perform planning updates using the learned model.

Randomly samples stored transitions from the model and applies Q-value updates. An exploration bonus can be added to encourage revisiting less frequently updated state-action pairs.

reset()

Reset the agent’s internal state.

Initializes the Q-table to zeros. If planning is enabled, also initializes the tau matrix, which tracks the time since each state-action pair was last updated.

select_action(q_values)

Select an action using epsilon-greedy exploration.

Parameters

q_valuesnumpy.ndarray: Array of Q-values for the current state.

Returns

actionint: The chosen action.

start(new_state)

Begin a new episode.

Parameters

new_stateint: The initial state observed from the environment.

Returns

actionint: The first action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates Q-values based on the transition and, if planning is enabled, performs additional simulated updates using the learned model.

Parameters

rewardfloat: Reward received from the previous action.
new_stateint: The new state observed from the environment.

Returns

actionint: The next action chosen by the agent.

update_model(prev_state, prev_action, reward, new_state)

Update the agent’s model of the environment.

Stores the transition (state, action → next state, reward) in the model. For unseen states, initializes all actions with default transitions.

Parameters

prev_stateint: The previous state.
prev_actionint: The action taken in the previous state.
rewardfloat: The reward received.
new_stateint: The new state observed.

update_q_values(prev_state, prev_action, reward, new_state)

Update Q-values using the Q-learning update rule.

The update is based on the maximum Q-value in the next state, rather than the expected value under the current policy. This makes Q-learning an off-policy method.

Parameters

prev_stateint: The previous state index.
prev_actionint: The action taken in the previous state.
rewardfloat: The reward received after taking the action.
new_stateint: The new state index observed after the transition.

Notes

The Q-value update follows:

\[Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \Big]\]

where \(\alpha\) is the step size, \(\gamma\) is the discount factor, and \(\max_{a'} Q(s', a')\) is the maximum action value in the next state.