DPO

Documentation for Deep Policy Optimization (DPO) Module

Overview

Deep Policy Optimization (DPO) is a PyTorch module designed for optimizing policies in decision-making models. It utilizes a reference model and a trainable policy model to compute loss values that guide the learning process.

Class Definition

class DPO(nn.Module):
    def __init__(self, model: nn.Module, *, beta: float = 0.1): ...

Arguments

Argument	Type	Description	Default
`model`	`nn.Module`	The policy model to be optimized.	-
`beta`	`float`	A parameter controlling the influence of log-ratios in loss.	`0.1`

Methods

`forward(preferred_seq: Tensor, unpreferred_seq: Tensor) -> Tensor`

Computes the loss based on the difference in log probabilities between preferred and unpreferred sequences.

Arguments

Argument	Type	Description
`preferred_seq`	`Tensor`	The sequence of actions/decisions preferred.
`unpreferred_seq`	`Tensor`	The sequence of actions/decisions unpreferred.

Returns

A torch.Tensor representing the computed loss.

Usage Examples

Example 1: Basic Setup and Usage

import torch
from torch import nn

from zeta.rl import DPO


# Define a simple policy model
class PolicyModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.fc = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        return self.fc(x)


input_dim = 10
output_dim = 5
policy_model = PolicyModel(input_dim, output_dim)

# Initialize DPO with the policy model
dpo_model = DPO(model=policy_model, beta=0.1)

# Sample preferred and unpreferred sequences
preferred_seq = torch.randn(1, 10, 10)
unpreferred_seq = torch.randn(1, 10, 10)

# Compute loss
loss = dpo_model(preferred_seq, unpreferred_seq)
print(loss)

Example 2: Integrating with an Optimizer

optimizer = torch.optim.Adam(dpo_model.parameters(), lr=0.001)

# Training loop
for epoch in range(100):
    optimizer.zero_grad()
    loss = dpo_model(preferred_seq, unpreferred_seq)
    loss.backward()
    optimizer.step()

Notes

Ensure that preferred_seq and unpreferred_seq have the same shape and are compatible with the input dimensions of the policy model.
beta is a hyperparameter and may require tuning for different applications.
The policy model should be structured to output logits compatible with the sequences being evaluated.

This documentation provides a comprehensive guide to utilizing the DPO module in various decision-making contexts. The examples demonstrate basic usage and integration within a training loop.