Planning Agent
- class rlforge.agents.tabular.planning_agent.PlanningAgent(step_size, discount, num_states, num_actions, epsilon=0.1, planning=False, planning_steps=0, exploration_bonus=0)
Abstract tabular agent that integrates planning steps to accelerate convergence.
This agent maintains a Q-table and, optionally, a model of the environment for planning-based updates. The planning mechanism allows the agent to simulate experience from its learned model, improving sample efficiency compared to purely online learning.
Notes
This is a base class meant to be subclassed. The method
update_q_values()is intentionally left empty and must be implemented by derived classes (e.g., SARSA, Q-learning, Expected SARSA).Planning is optional and controlled via the
planningflag andplanning_stepsparameter.
- end(reward)
Complete an episode.
Performs a final update to the Q-value of the last state-action pair.
Parameters
- rewardfloat
The terminal reward received at the end of the episode.
- planning_step()
Perform planning updates using the learned model.
Randomly samples stored transitions from the model and applies Q-value updates. An exploration bonus can be added to encourage revisiting less frequently updated state-action pairs.
- reset()
Reset the agent’s internal state.
Initializes the Q-table to zeros. If planning is enabled, also initializes the
taumatrix, which tracks the time since each state-action pair was last updated.
- select_action(q_values)
Select an action using epsilon-greedy exploration.
Parameters
- q_valuesnumpy.ndarray
Array of Q-values for the current state.
Returns
- actionint
The chosen action.
- start(new_state)
Begin a new episode.
Parameters
- new_stateint
The initial state observed from the environment.
Returns
- actionint
The first action selected by the agent.
- step(reward, new_state)
Take a step in the environment.
Updates Q-values based on the transition and, if planning is enabled, performs additional simulated updates using the learned model.
Parameters
- rewardfloat
Reward received from the previous action.
- new_stateint
The new state observed from the environment.
Returns
- actionint
The next action chosen by the agent.
- update_model(prev_state, prev_action, reward, new_state)
Update the agent’s model of the environment.
Stores the transition (state, action → next state, reward) in the model. For unseen states, initializes all actions with default transitions.
Parameters
- prev_stateint
The previous state.
- prev_actionint
The action taken in the previous state.
- rewardfloat
The reward received.
- new_stateint
The new state observed.
- update_q_values(prev_state, prev_action, reward, new_state)
Update Q-values based on the observed transition.
This method is intentionally left empty and must be implemented by subclasses to define the specific update rule (e.g., SARSA, Q-learning, Expected SARSA).
Parameters
- prev_stateint
The previous state.
- prev_actionint
The action taken in the previous state.
- rewardfloat
The reward received.
- new_stateint
The new state observed.