SophiaG Optimizer for Zeta Library

Overview

The SophiaG optimizer is designed to adaptively change learning rates during training, offering a combination of momentum-based acceleration and second-order Hessian-based adaptive learning rates. This optimizer is particularly useful for training deep neural networks and optimizing complex, non-convex loss functions. Key features include:

Momentum: Utilizes exponentially moving averages of gradients.
Adaptive Learning Rate: Adjusts the learning rate based on the second-order Hessian information.
Regularization: Applies weight decay to avoid overfitting.
Optional Settings: Allows for maximizing the loss function, customizable settings for capturable and dynamic parameters.

Class Definition

class SophiaG(Optimizer):
    def __init__(self, params, lr=1e-4, betas=(0.965, 0.99), rho=0.04,
                 weight_decay=1e-1, *, maximize: bool = False,
                 capturable: bool = False, dynamic: bool = False):

Parameters:

params (iterable): Iterable of parameters to optimize.
lr (float, default=1e-4): Learning rate.
betas (Tuple[float, float], default=(0.965, 0.99)): Coefficients used for computing running averages of gradient and Hessian.
rho (float, default=0.04): Damping factor for Hessian-based updates.
weight_decay (float, default=1e-1): Weight decay factor.
maximize (bool, default=False): Whether to maximize the loss function.
capturable (bool, default=False): Enable/Disable special capturing features.
dynamic (bool, default=False): Enable/Disable dynamic adjustments of the optimizer.

Usage and Functionality

1. Initialization

Upon initialization, the optimizer performs validation on its parameters and sets them as the default parameters for parameter groups.

from zeta import SophiaG

optimizer = SophiaG(model.parameters(), lr=0.01, betas=(0.9, 0.999), weight_decay=1e-4)

2. Step Forward

The .step() method updates the model parameters. The function is decorated with @torch.no_grad() to avoid saving any more computation graphs for gradient computation.

loss = criterion(output, target)
loss.backward()
optimizer.step()

3. Update Hessian and Exponential Average

The optimizer has internal methods to update the Hessian and Exponential Moving Average (EMA) of the gradients, controlled by betas.

4. SophiaG Function

The core SophiaG function updates the parameters based on the gradient (grad), moving average (exp_avg), and Hessian (hessian). It uses the following update formula:

[ \text{param} = \text{param} - \text{lr} \times \left( \text{beta}_1 \times \text{exp_avg} + \frac{(1-\text{beta}_1) \times \text{grad}}{( \text{beta}_2 \times \text{hessian} + (1-\text{beta}_2) )^{\rho}} \right) ]

Usage Examples

1. Basic Usage:

import torch
import torch.nn as nn

from zeta import SophiaG

model = nn.Linear(10, 1)
optimizer = SophiaG(model.parameters(), lr=0.01)

2. Customizing Betas and Learning Rate:

import torch

from zeta import SophiaG

optimizer = SophiaG(model.parameters(), lr=0.001, betas=(0.9, 0.999))

3. Using with Weight Decay:

from zeta import SophiaG

optimizer = SophiaG(model.parameters(), lr=0.01, weight_decay=1e-4)

Additional Information and Tips

Make sure that the parameters passed are compatible with the model you are using.
To maximize the loss function (useful in adversarial training), set maximize=True.

Common Issues

If sparse gradients are involved, the SophiaG optimizer is not applicable.

References and Resources

For further questions or issues, visit our GitHub repository.