AlibiPositionalBias Documentation
Introduction
The AlibiPositionalBias
module belongs to the zeta library and plays a crucial role in handling positional bias for multi-head attention mechanisms. Specifically, it attempts to alleviate the absolute positional bias based on the number of attention heads.
Class Definition:
Parameters:
- heads (
int
): Number of attention heads for which the slopes need to be calculated. - total_heads (
int
): Total number of attention heads in the network.
Attributes:
- slopes (
Tensor
): Tensor containing slope values, which are computed based on the number of heads. - bias (
Tensor
orNone
): Tensor for storing positional bias values. If not initialized or needs recomputation, it would be None.
Methods:
__init__(self, heads, total_heads, **kwargs) -> None
:
Initializes the AlibiPositionalBias
module.
get_bias(self, i, j, device) -> Tensor
:
Computes the positional bias for given dimensions i and j.
- Parameters:
- i (
int
): One dimension of the required positional bias. - j (
int
): Second dimension of the required positional bias. - device (
torch.device
): The device on which computations are to be performed.
_get_slopes(heads) -> List[float]
:
A static method that calculates slopes based on the number of attention heads.
- Parameters:
- heads (
int
): Number of attention heads.
forward(self, i, j) -> Tensor
:
Computes or retrieves the bias tensor for given dimensions.
- Parameters:
- i (
int
): One dimension for the required positional bias. - j (
int
): Second dimension for the required positional bias.
Mathematical Formula:
Given n
attention heads, the alibi positional bias can be represented as:
[ \text{Bias} = \text{-abs}(j_{\text{range}}) \times \text{slope} ]
Where:
- ( j_{\text{range}} ) is an array of numbers from 0
to j-1
.
- slope
is computed based on the number of heads using _get_slopes
method.
Usage Examples:
Example 1: Initialize and compute bias
import torch
from zeta import AlibiPositionalBias
bias_module = AlibiPositionalBias(heads=4, total_heads=8)
bias = bias_module(10, 10)
print(bias)
Example 2: Retrieve stored bias
Example 3: Computing bias for different dimensions
Note:
- It's crucial to ensure that the
total_heads
parameter is always greater than or equal to theheads
parameter during initialization. - The device property is internally used to determine the computation device based on the registered buffers.
References:
For a deeper understanding and applications of positional bias in attention mechanisms, one may refer to the foundational paper on Transformer architectures: - Attention Is All You Need
Also, the einops
library provides a versatile interface for tensor manipulations. More details can be found at its official documentation.