Skip to content



The FeedForward module is a feedforward neural network with LayerNorms and activation functions, designed for various transformer-based models. It offers flexibility in terms of the activation functions used, allowing you to choose between GELU, SiLU, or ReLU squared. Additionally, it supports the Gated Linear Unit (GLU) activation and LayerNorm (LN) after the activation layer for advanced configurations.

Class Definition

class FeedForward(nn.Module):
    Feedforward neural network with LayerNorms and GELU activations

        dim (int): Input dimension.
        dim_out (int, optional): Output dimension. Defaults to None (same as input dimension).
        mult (int, optional): Multiplier for the hidden dimension. Defaults to 4.
        glu (bool, optional): Whether to use the Gated Linear Unit (GLU) activation. Defaults to False.
        glu_mult_bias (bool, optional): Whether to use a bias term with the GLU activation. Defaults to False.
        swish (bool, optional): Whether to use the SiLU activation. Defaults to False.
        relu_squared (bool, optional): Whether to use the ReLU squared activation. Defaults to False.
        post_act_ln (bool, optional): Whether to apply LayerNorm after activation. Defaults to False.
        dropout (float, optional): Dropout probability. Defaults to 0.0.
        no_bias (bool, optional): Whether to use bias terms in linear layers. Defaults to False.
        zero_init_output (bool, optional): Whether to initialize the output linear layer to zero. Defaults to False.

    >>> model = FeedForward(768, 2048, 0.1)
    >>> x = torch.randn(1, 768)
    >>> model(x).shape


Parameter Name Description Default Value Type
dim Input dimension - int
dim_out Output dimension (optional) None int
mult Multiplier for hidden dimension 4 int
glu Whether to use GLU activation False bool
glu_mult_bias Whether to use bias term with GLU activation False bool
swish Whether to use SiLU activation False bool
relu_squared Whether to use ReLU squared activation False bool
post_act_ln Whether to apply LayerNorm after activation False bool
dropout Dropout probability 0.0 float
no_bias Whether to use bias terms in linear layers False bool
zero_init_output Whether to initialize the output linear layer to zero False bool

Usage Examples

Example 1: Basic FeedForward Layer

model = FeedForward(768, 2048, 0.1)
x = torch.randn(1, 768)
output = model(x)

Example 2: Using SiLU Activation

model = FeedForward(512, 1024, swish=True)
x = torch.randn(1, 512)
output = model(x)

Example 3: Advanced Configuration with GLU Activation and LayerNorm

model = FeedForward(256, 512, glu=True, post_act_ln=True, dropout=0.2)
x = torch.randn(1, 256)
output = model(x)


The FeedForward module performs a feedforward operation on the input tensor x. It consists of a multi-layer perceptron (MLP) with an optional activation function and LayerNorm. The exact configuration depends on the parameters provided during initialization.

The key steps of the forward pass include: 1. Projection of the input tensor x to an inner dimension. 2. Application of the specified activation function (e.g., GELU, SiLU, or ReLU squared). 3. Optionally, LayerNorm is applied after the activation. 4. Dropout is applied for regularization. 5. Finally, a linear transformation maps the inner dimension to the output dimension.

The FeedForward module offers flexibility in choosing activation functions, enabling you to experiment with different configurations in transformer-based models.

Tips and Considerations

  • Experiment with different activation functions to find the best configuration for your model.
  • Adjust the dropout rate to control overfitting.
  • Consider using LayerNorm for improved performance, especially in deep networks.
  • The zero_init_output option can be useful for certain initialization strategies.