Tile Coding Linear Semi-gradient Q-Learning

class rlforge.agents.semi_gradient.linear_sg_agent.LinearQAgent(step_size, discount, num_actions, dims_ranges, epsilon=0.1, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())

Linear Q-Learning Agent with Tile Coding function approximation.

This agent implements Q-learning using a linear function approximator over tile-coded features. It supports epsilon-greedy exploration and incremental weight updates based on temporal-difference (TD) errors.

Parameters

step_sizefloat: Learning rate for weight updates.
discountfloat: Discount factor (γ) applied to future rewards.
num_actionsint: Number of discrete actions available in the environment.
dims_rangeslist of tuple: Ranges for each state dimension, used by the tile coder.
epsilonfloat, optional: Exploration rate for epsilon-greedy policy (default=0.1).
iht_sizeint, optional: Size of the index hash table for tile coding (default=4096).
num_tilingsint, optional: Number of tilings used in tile coding (default=8).
num_tilesint, optional: Number of tiles per dimension (default=8).
wrap_dimstuple, optional: Dimensions to wrap in tile coding (default=()).

end(reward)

Complete an episode.

Performs a final update of the linear model weights using the terminal reward and the last cached state-action pair.

Parameters

rewardfloat: Final reward received at the end of the episode.

get_td_error(prev_tiles, prev_action, reward, active_tiles)

Compute the temporal-difference (TD) error.

Calculates the TD error using the reward, the discounted maximum Q-value of the next state, and the Q-value of the previous state-action.

Parameters

prev_tilesarray-like: Active tiles for the previous state.
prev_actionint: Action taken in the previous state.
rewardfloat: Reward received for the transition.
active_tilesarray-like: Active tiles for the current state.

Returns

float: The computed TD error.

reset()

Reset the agent.

Resets the weights of the linear regression model to their initial values.

select_action(q_values)

Select an action using epsilon-greedy policy.

Parameters

q_valuesarray-like: Estimated Q-values for all actions in the current state.

Returns

int: The action selected by epsilon-greedy exploration.

start(new_state)

Begin a new episode.

Extracts active tiles for the initial state, computes Q-values, selects an action using epsilon-greedy, and caches the state-action pair for future updates.

Parameters

new_statearray-like: The initial state observed from the environment.

Returns

int: The action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates the linear model weights using the TD error from the previous transition, then selects the next action based on the new state.

Parameters

rewardfloat: Reward received from the previous action.
new_statearray-like: The new state observed from the environment.

Returns

int: The next action chosen by the agent.