Softmax Actor Critic
- class rlforge.agents.policy_gradient.softmax_actor_critic.SoftmaxActorCriticAgent(actor_step_size, critic_step_size, avg_reward_step_size, num_actions, dims_ranges, temperature=1, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())
Softmax Actor-Critic Agent with average reward baseline.
This agent implements a vanilla actor-critic algorithm using tile coding for feature extraction and linear function approximation for both the actor (policy) and critic (value function). The actor updates are guided by the temporal-difference (TD) error, and actions are selected using a softmax distribution over action preferences.
Parameters
- actor_step_sizefloat
Learning rate for the actor updates.
- critic_step_sizefloat
Learning rate for the critic updates.
- avg_reward_step_sizefloat
Step size for updating the average reward baseline.
- num_actionsint
Number of discrete actions available in the environment.
- dims_rangeslist of tuple
Ranges for each state dimension, used by the tile coder.
- temperaturefloat, optional
Temperature parameter for softmax exploration (default=1).
- iht_sizeint, optional
Size of the index hash table for tile coding (default=4096).
- num_tilingsint, optional
Number of tilings used in tile coding (default=8).
- num_tilesint, optional
Number of tiles per dimension (default=8).
- wrap_dimstuple, optional
Dimensions to wrap in tile coding (default=()).
- end(reward)
Complete an episode.
Performs a final update of the actor and critic using the terminal reward and the last cached state-action pair.
Parameters
- rewardfloat
Final reward received at the end of the episode.
- get_td_error(prev_tiles, reward, active_tiles, avg_reward)
Compute the temporal-difference (TD) error.
Calculates the TD error using the reward, average reward baseline, the critic’s prediction for the next state, and the critic’s prediction for the previous state.
Parameters
- prev_tilesarray-like
Active tiles for the previous state.
- rewardfloat
Reward received for the transition.
- active_tilesarray-like
Active tiles for the current state.
- avg_rewardfloat
Current estimate of the average reward baseline.
Returns
- float
The computed TD error.
- reset()
Reset the agent.
Resets the actor and critic weights to their initial values and clears the average reward baseline.
- select_action(q_values, temperature)
Select an action using a softmax distribution.
Parameters
- q_valuesarray-like
Action preferences or Q-values for the current state.
- temperaturefloat
Temperature parameter controlling exploration.
Returns
- int
The action selected by sampling from the softmax distribution.
- start(new_state)
Begin a new episode.
Extracts active tiles for the initial state, computes action preferences, selects an action using softmax, and caches the state-action pair for future updates.
Parameters
- new_statearray-like
The initial state observed from the environment.
Returns
- int
The action selected by the agent.
- step(reward, new_state)
Take a step in the environment.
Updates the actor and critic weights using the TD error and average reward baseline, then selects the next action.
Parameters
- rewardfloat
Reward received from the previous action.
- new_statearray-like
The new state observed from the environment.
Returns
- int
The next action chosen by the agent.