Deep Q-Network PyTorch (DQN)

class rlforge.agents.semi_gradient.dqn_pytorch.DQNTorchAgent(state_dim, action_dim, network_architecture=(64, 64), learning_rate=0.001, discount=0.99, temperature=1.0, target_network_update_steps=1000, num_replay=1, experience_buffer_size=100000, mini_batch_size=32, device='cpu')

Deep Q-Network (DQN) Agent implemented in PyTorch.

This agent uses a feedforward neural network to approximate Q-values for discrete actions. It supports both single-environment and vectorized-environment APIs, experience replay, and a target network for stable training.

Parameters

state_dimint

Dimension of the input state space.

action_dimint

Number of discrete actions available in the environment.

network_architecturetuple of int, optional

Sizes of hidden layers in the Q-network (default=(64, 64)).

learning_ratefloat, optional

Learning rate for the optimizer (default=1e-3).

discountfloat, optional

Discount factor γ applied to future rewards (default=0.99).

temperaturefloat, optional

Temperature parameter for softmax exploration (default=1.0).

target_network_update_stepsint, optional

Number of training steps between target network synchronizations (default=1000).

num_replayint, optional

Number of replay updates per environment step (default=1).

experience_buffer_sizeint, optional

Maximum size of the replay buffer (default=100000).

mini_batch_sizeint, optional

Size of mini-batches sampled from the replay buffer (default=32).

devicestr or torch.device, optional

Device to run computations on (“cpu” or “cuda”).

end(reward)

Complete an episode in a single environment.

Stores the final transition into the replay buffer.

Parameters

rewardfloat

Final reward received at the end of the episode.

end_batch(rewards)

Handle the final reward and transition for a batch of terminated environments.

This method stores terminal transitions into the replay buffer and performs training updates if enough samples are available. It assumes that the agent’s internal caches for previous states and actions are valid for the terminated episodes.

Parameters

rewardsarray-like, shape (N,)

Final rewards received for each of the N terminated environments.

Notes

  • Each terminal transition is stored as (S_t, A_t, R_t, S_{t+1}=S_t, Done=True).

  • Training is triggered after storing transitions if the replay buffer contains at least mini_batch_size samples.

  • The internal state/action cache is reset if all environments in the batch have terminated.

load(filepath)

Load network weights from a file.

This method updates the main network with the saved weights and immediately synchronizes the target network to match.

Parameters

filepathstr

The path to the file containing the saved state dictionary.

Notes

The networks are set to evaluation mode during loading and then returned to their previous state.

reset()

Reset the agent’s internal state and networks.

Clears the replay buffer, resets counters, and rebuilds the main and target networks for a fresh start.

Notes

  • Resets elapsed_training_steps and total_steps to zero.

  • Clears cached previous state and action.

  • Calls reset_networks() to reinitialize the Q-networks and optimizer.

reset_networks()

Reset and rebuild networks and optimizer.

Creates a new main network and target network, initializes weights, and sets up the Adam optimizer.

save(filepath)

Save the agent’s main network weights to a file.

This method saves the state dictionary of the main Q-network, allowing the agent’s learned policy to be retrieved later.

Parameters

filepathstr

The path to the file where the weights should be saved (typically ending in .pth or .pt).

start(state)

Begin a new episode in a single environment.

Selects an initial action based on the current state.

Parameters

statearray-like

Initial state of the environment.

Returns

int

Selected action.

start_batch(states, deterministic=False)

Begin a batch of episodes.

Selects actions for multiple environments simultaneously.

Parameters

statesarray-like, shape (N, state_dim)

Batch of initial states.

deterministicbool, optional

If True, selects greedy actions; otherwise uses softmax exploration.

Returns

numpy.ndarray

Array of selected actions of shape (N,).

step(reward, new_state, terminal=False)

Take a step in a single environment.

Updates replay buffer and selects the next action.

Parameters

rewardfloat

Reward received from the previous action.

new_statearray-like

Next state observed.

terminalbool, optional

Whether the episode has terminated (default=False).

Returns

int

Selected action.

step_batch(rewards, next_states, dones, deterministic=False)

Take a step in multiple environments.

Stores transitions in the replay buffer, performs training updates, synchronizes the target network if needed, and selects next actions.

Parameters

rewardsarray-like, shape (N,)

Rewards from the previous actions.

next_statesarray-like, shape (N, state_dim)

Next states observed.

donesarray-like, shape (N,)

Boolean flags indicating episode termination.

deterministicbool, optional

If True, selects greedy actions; otherwise uses softmax exploration.

Returns

numpy.ndarray

Array of selected actions of shape (N,).

class rlforge.agents.semi_gradient.dqn_pytorch.ReplayBuffer(size, mini_batch_size)

An optimized fixed-size replay buffer using collections.deque for O(1) appends and pops.

append(state, action, reward, terminal, new_state)

Add a single new experience to the buffer.

clear()

Clears the buffer content.

sample()

Randomly sample a mini-batch of experiences from the buffer.

rlforge.agents.semi_gradient.dqn_pytorch.softmax(x, temperature=1.0)

Compute the softmax over the last dimension of an array.

This is typically used to convert Q-values into a probability distribution for exploration.

Parameters

xnp.ndarray

Input Q-values, typically of shape (N, action_dim).

temperaturefloat

Controls the entropy of the distribution. Higher temperature results in more random actions.

Returns

np.ndarray

Softmax probabilities with the same shape as x.