Tile Coding Linear Semi-gradient Q-Learning

class rlforge.agents.semi_gradient.linear_sg_agent.LinearQAgent(step_size, discount, num_actions, dims_ranges, epsilon=0.1, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())

Linear Q-Learning Agent with Tile Coding function approximation.

This agent implements Q-learning using a linear function approximator over tile-coded features. It supports epsilon-greedy exploration and incremental weight updates based on temporal-difference (TD) errors.

Parameters

step_sizefloat

Learning rate for weight updates.

discountfloat

Discount factor (γ) applied to future rewards.

num_actionsint

Number of discrete actions available in the environment.

dims_rangeslist of tuple

Ranges for each state dimension, used by the tile coder.

epsilonfloat, optional

Exploration rate for epsilon-greedy policy (default=0.1).

iht_sizeint, optional

Size of the index hash table for tile coding (default=4096).

num_tilingsint, optional

Number of tilings used in tile coding (default=8).

num_tilesint, optional

Number of tiles per dimension (default=8).

wrap_dimstuple, optional

Dimensions to wrap in tile coding (default=()).

end(reward)

Complete an episode.

Performs a final update of the linear model weights using the terminal reward and the last cached state-action pair.

Parameters

rewardfloat

Final reward received at the end of the episode.

get_td_error(prev_tiles, prev_action, reward, active_tiles)

Compute the temporal-difference (TD) error.

Calculates the TD error using the reward, the discounted maximum Q-value of the next state, and the Q-value of the previous state-action.

Parameters

prev_tilesarray-like

Active tiles for the previous state.

prev_actionint

Action taken in the previous state.

rewardfloat

Reward received for the transition.

active_tilesarray-like

Active tiles for the current state.

Returns

float

The computed TD error.

reset()

Reset the agent.

Resets the weights of the linear regression model to their initial values.

select_action(q_values)

Select an action using epsilon-greedy policy.

Parameters

q_valuesarray-like

Estimated Q-values for all actions in the current state.

Returns

int

The action selected by epsilon-greedy exploration.

start(new_state)

Begin a new episode.

Extracts active tiles for the initial state, computes Q-values, selects an action using epsilon-greedy, and caches the state-action pair for future updates.

Parameters

new_statearray-like

The initial state observed from the environment.

Returns

int

The action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates the linear model weights using the TD error from the previous transition, then selects the next action based on the new state.

Parameters

rewardfloat

Reward received from the previous action.

new_statearray-like

The new state observed from the environment.

Returns

int

The next action chosen by the agent.