Tile Coding Linear Semi-gradient Q-Learning
- class rlforge.agents.semi_gradient.linear_sg_agent.LinearQAgent(step_size, discount, num_actions, dims_ranges, epsilon=0.1, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())
Linear Q-Learning Agent with Tile Coding function approximation.
This agent implements Q-learning using a linear function approximator over tile-coded features. It supports epsilon-greedy exploration and incremental weight updates based on temporal-difference (TD) errors.
Parameters
- step_sizefloat
Learning rate for weight updates.
- discountfloat
Discount factor (γ) applied to future rewards.
- num_actionsint
Number of discrete actions available in the environment.
- dims_rangeslist of tuple
Ranges for each state dimension, used by the tile coder.
- epsilonfloat, optional
Exploration rate for epsilon-greedy policy (default=0.1).
- iht_sizeint, optional
Size of the index hash table for tile coding (default=4096).
- num_tilingsint, optional
Number of tilings used in tile coding (default=8).
- num_tilesint, optional
Number of tiles per dimension (default=8).
- wrap_dimstuple, optional
Dimensions to wrap in tile coding (default=()).
- end(reward)
Complete an episode.
Performs a final update of the linear model weights using the terminal reward and the last cached state-action pair.
Parameters
- rewardfloat
Final reward received at the end of the episode.
- get_td_error(prev_tiles, prev_action, reward, active_tiles)
Compute the temporal-difference (TD) error.
Calculates the TD error using the reward, the discounted maximum Q-value of the next state, and the Q-value of the previous state-action.
Parameters
- prev_tilesarray-like
Active tiles for the previous state.
- prev_actionint
Action taken in the previous state.
- rewardfloat
Reward received for the transition.
- active_tilesarray-like
Active tiles for the current state.
Returns
- float
The computed TD error.
- reset()
Reset the agent.
Resets the weights of the linear regression model to their initial values.
- select_action(q_values)
Select an action using epsilon-greedy policy.
Parameters
- q_valuesarray-like
Estimated Q-values for all actions in the current state.
Returns
- int
The action selected by epsilon-greedy exploration.
- start(new_state)
Begin a new episode.
Extracts active tiles for the initial state, computes Q-values, selects an action using epsilon-greedy, and caches the state-action pair for future updates.
Parameters
- new_statearray-like
The initial state observed from the environment.
Returns
- int
The action selected by the agent.
- step(reward, new_state)
Take a step in the environment.
Updates the linear model weights using the TD error from the previous transition, then selects the next action based on the new state.
Parameters
- rewardfloat
Reward received from the previous action.
- new_statearray-like
The new state observed from the environment.
Returns
- int
The next action chosen by the agent.