Deep Q-Network PyTorch (DQN)

class rlforge.agents.semi_gradient.dqn_pytorch.DQNTorchAgent(state_dim, action_dim, network_architecture=(64, 64), learning_rate=0.001, discount=0.99, temperature=1.0, target_network_update_steps=1000, num_replay=1, experience_buffer_size=100000, mini_batch_size=32, device='cpu')

Deep Q-Network (DQN) Agent implemented in PyTorch.

This agent uses a feedforward neural network to approximate Q-values for discrete actions. It supports both single-environment and vectorized-environment APIs, experience replay, and a target network for stable training.

Parameters

state_dimint: Dimension of the input state space.
action_dimint: Number of discrete actions available in the environment.
network_architecturetuple of int, optional: Sizes of hidden layers in the Q-network (default=(64, 64)).
learning_ratefloat, optional: Learning rate for the optimizer (default=1e-3).
discountfloat, optional: Discount factor γ applied to future rewards (default=0.99).
temperaturefloat, optional: Temperature parameter for softmax exploration (default=1.0).
target_network_update_stepsint, optional: Number of training steps between target network synchronizations (default=1000).
num_replayint, optional: Number of replay updates per environment step (default=1).
experience_buffer_sizeint, optional: Maximum size of the replay buffer (default=100000).
mini_batch_sizeint, optional: Size of mini-batches sampled from the replay buffer (default=32).
devicestr or torch.device, optional: Device to run computations on (“cpu” or “cuda”).

end(reward)

Complete an episode in a single environment.

Stores the final transition into the replay buffer.

Parameters

rewardfloat: Final reward received at the end of the episode.

end_batch(rewards)

Handle the final reward and transition for a batch of terminated environments.

This method stores terminal transitions into the replay buffer and performs training updates if enough samples are available. It assumes that the agent’s internal caches for previous states and actions are valid for the terminated episodes.

Parameters

rewardsarray-like, shape (N,): Final rewards received for each of the N terminated environments.

Notes

Each terminal transition is stored as (S_t, A_t, R_t, S_{t+1}=S_t, Done=True).
Training is triggered after storing transitions if the replay buffer contains at least mini_batch_size samples.
The internal state/action cache is reset if all environments in the batch have terminated.

load(filepath)

Load network weights from a file.

This method updates the main network with the saved weights and immediately synchronizes the target network to match.

Parameters

filepathstr: The path to the file containing the saved state dictionary.

Notes

The networks are set to evaluation mode during loading and then returned to their previous state.

reset()

Reset the agent’s internal state and networks.

Clears the replay buffer, resets counters, and rebuilds the main and target networks for a fresh start.

Notes

Resets elapsed_training_steps and total_steps to zero.
Clears cached previous state and action.
Calls reset_networks() to reinitialize the Q-networks and optimizer.

reset_networks()

Reset and rebuild networks and optimizer.

Creates a new main network and target network, initializes weights, and sets up the Adam optimizer.

save(filepath)

Save the agent’s main network weights to a file.

This method saves the state dictionary of the main Q-network, allowing the agent’s learned policy to be retrieved later.

Parameters

filepathstr: The path to the file where the weights should be saved (typically ending in .pth or .pt).

start(state)

Begin a new episode in a single environment.

Selects an initial action based on the current state.

Parameters

statearray-like: Initial state of the environment.

Returns

int: Selected action.

start_batch(states, deterministic=False)

Begin a batch of episodes.

Selects actions for multiple environments simultaneously.

Parameters

statesarray-like, shape (N, state_dim): Batch of initial states.
deterministicbool, optional: If True, selects greedy actions; otherwise uses softmax exploration.

Returns

numpy.ndarray: Array of selected actions of shape (N,).

step(reward, new_state, terminal=False)

Take a step in a single environment.

Updates replay buffer and selects the next action.

Parameters

rewardfloat: Reward received from the previous action.
new_statearray-like: Next state observed.
terminalbool, optional: Whether the episode has terminated (default=False).

Returns

int: Selected action.

step_batch(rewards, next_states, dones, deterministic=False)

Take a step in multiple environments.

Stores transitions in the replay buffer, performs training updates, synchronizes the target network if needed, and selects next actions.

Parameters

rewardsarray-like, shape (N,): Rewards from the previous actions.
next_statesarray-like, shape (N, state_dim): Next states observed.
donesarray-like, shape (N,): Boolean flags indicating episode termination.
deterministicbool, optional: If True, selects greedy actions; otherwise uses softmax exploration.

Returns

numpy.ndarray: Array of selected actions of shape (N,).

class rlforge.agents.semi_gradient.dqn_pytorch.ReplayBuffer(size, mini_batch_size)

An optimized fixed-size replay buffer using collections.deque for O(1) appends and pops.

append(state, action, reward, terminal, new_state): Add a single new experience to the buffer.

clear(): Clears the buffer content.

sample(): Randomly sample a mini-batch of experiences from the buffer.

rlforge.agents.semi_gradient.dqn_pytorch.softmax(x, temperature=1.0)

Compute the softmax over the last dimension of an array.

This is typically used to convert Q-values into a probability distribution for exploration.

Parameters

xnp.ndarray: Input Q-values, typically of shape (N, action_dim).
temperaturefloat: Controls the entropy of the distribution. Higher temperature results in more random actions.

Returns

np.ndarray: Softmax probabilities with the same shape as x.