Planning Agent

class rlforge.agents.tabular.planning_agent.PlanningAgent(step_size, discount, num_states, num_actions, epsilon=0.1, planning=False, planning_steps=0, exploration_bonus=0)

Abstract tabular agent that integrates planning steps to accelerate convergence.

This agent maintains a Q-table and, optionally, a model of the environment for planning-based updates. The planning mechanism allows the agent to simulate experience from its learned model, improving sample efficiency compared to purely online learning.

Notes

This is a base class meant to be subclassed. The method update_q_values() is intentionally left empty and must be implemented by derived classes (e.g., SARSA, Q-learning, Expected SARSA).
Planning is optional and controlled via the planning flag and planning_steps parameter.

end(reward)

Complete an episode.

Performs a final update to the Q-value of the last state-action pair.

Parameters

rewardfloat: The terminal reward received at the end of the episode.

planning_step()

Perform planning updates using the learned model.

Randomly samples stored transitions from the model and applies Q-value updates. An exploration bonus can be added to encourage revisiting less frequently updated state-action pairs.

reset()

Reset the agent’s internal state.

Initializes the Q-table to zeros. If planning is enabled, also initializes the tau matrix, which tracks the time since each state-action pair was last updated.

select_action(q_values)

Select an action using epsilon-greedy exploration.

Parameters

q_valuesnumpy.ndarray: Array of Q-values for the current state.

Returns

actionint: The chosen action.

start(new_state)

Begin a new episode.

Parameters

new_stateint: The initial state observed from the environment.

Returns

actionint: The first action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates Q-values based on the transition and, if planning is enabled, performs additional simulated updates using the learned model.

Parameters

rewardfloat: Reward received from the previous action.
new_stateint: The new state observed from the environment.

Returns

actionint: The next action chosen by the agent.

update_model(prev_state, prev_action, reward, new_state)

Update the agent’s model of the environment.

Stores the transition (state, action → next state, reward) in the model. For unseen states, initializes all actions with default transitions.

Parameters

prev_stateint: The previous state.
prev_actionint: The action taken in the previous state.
rewardfloat: The reward received.
new_stateint: The new state observed.

update_q_values(prev_state, prev_action, reward, new_state)

Update Q-values based on the observed transition.

This method is intentionally left empty and must be implemented by subclasses to define the specific update rule (e.g., SARSA, Q-learning, Expected SARSA).

Parameters

prev_stateint: The previous state.
prev_actionint: The action taken in the previous state.
rewardfloat: The reward received.
new_stateint: The new state observed.