Skip to content

Instantly share code, notes, and snippets.

@JacquesGariepy
Created January 24, 2025 19:07
Show Gist options
  • Select an option

  • Save JacquesGariepy/8786c7db34191a3b74c993d72ba72875 to your computer and use it in GitHub Desktop.

Select an option

Save JacquesGariepy/8786c7db34191a3b74c993d72ba72875 to your computer and use it in GitHub Desktop.

HoloGate-Flow: A Multi-Activation Gated Feed-Forward Layer with Affine Flow Coupling

Author: Jacques Gariépy, [email protected] Date: January 2025


Abstract

Modern neural architectures, especially Transformers, rely on feed-forward networks (FFNs) to process representations after self-attention. Recent innovations such as Gated Linear Units (GLU, GeGLU, SwiGLU) have demonstrated that introducing gating mechanisms can substantially boost performance on tasks ranging from language modeling to computer vision. Concurrently, normalizing flows have provided a powerful framework for invertible transformations (scale and shift) used in generative modeling.

In this thesis-like document, we propose HoloGate-Flow, a new FFN design that:

  1. Splits the input into multiple sub-spaces, each undergoing different transformations and activations.
  2. Introduces a gating mechanism that modulates one sub-space with another.
  3. Incorporates an affine flow-inspired step in the skip-connection to preserve critical information, even when the gate closes.

The result is a more expressive and robust feed-forward layer that mitigates the risk of losing input signals, potentially improving gradient flow and overall model capacity. We provide comprehensive details on its formulation, implementation, theoretical considerations, and avenues for practical deployment.


1. Introduction

1.1. Background and Motivation

The Transformer architecture has revolutionized natural language processing and has increasingly been adapted to other domains. While the self-attention mechanism is widely recognized for its effectiveness, the feed-forward component also significantly contributes to the model’s expressivity and capacity. In standard Transformers, this FFN is often a two-layer MLP with a single nonlinear activation. Yet, as the number of parameters grows, even small design changes in the FFN can yield notable performance benefits.

1.2. Scope of this Document

This document aims to:

  • Present the conceptual foundation of HoloGate-Flow.
  • Compare it with well-known gating approaches (e.g., GLU, GeGLU, SwiGLU).
  • Demonstrate how affine transformations from normalizing flows can enhance skip-connections.
  • Offer guidance on implementing and experimenting with HoloGate-Flow in various deep learning tasks.

2. Related Work

2.1. Feed-Forward Networks in Transformers

The original Transformer [Vaswani et al., 2017] introduced a simple feed-forward sub-layer: [ \text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}W_1 + b_1) W_2 + b_2, ] inserted after multi-head attention, with a skip-connection: (\mathbf{x}_{\text{out}} = \mathbf{x} + \text{FFN}(\mathbf{x})). Variants of this FFN have been proposed to improve representation learning and parameter efficiency.

2.2. Gating Mechanisms (GLU, GeGLU, SwiGLU)

Gated Linear Units (GLU) [Dauphin et al., 2017] showed that splitting the hidden dimension into two parts and applying a Sigmoid gating on one portion can improve performance. GeGLU and SwiGLU [Shazeer, 2020] refined this by replacing ReLU with GeLU or Swish/SiLU, further boosting performance and stability.

2.3. Normalizing Flows (RealNVP, Glow, etc.)

Normalizing flows like RealNVP [Dinh et al., 2017] and Glow [Kingma et al., 2018] introduced affine coupling layers for invertible transformations. Although designed for generative modeling, the concept of “scale and shift conditioned on a partition of the input” can inspire new skip-connection designs in discriminative models as well.

2.4. Other FFN Enhancements

Aside from gating, researchers have also tested multi-branch FFNs, depthwise convolutions, and dynamic routing. However, few attempts combine gating with flow-like affine transforms in the residual path, which is precisely the focus of this work.


3. Proposed Approach: HoloGate-Flow

3.1. Overall Architecture

A typical Transformer block is:

x_in
  └─> [Multi-Head Attention + Residual] ─> x_mha
  └─> [FFN + Residual] ─> x_out

HoloGate-Flow replaces the standard FFN with a more elaborate structure, maintaining the residual connection but augmenting it with multi-activation gating and a flow-based affine step.

3.2. The Multi-Activation Gating Module

Instead of having a single feed-forward path, HoloGate-Flow splits the input vector (\mathbf{x}) into three parts (\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3). Each part is transformed by a distinct linear layer (or MLP) and activation function. One of these (typically (\mathbf{x}_3)) becomes a “meta-gate” applied to another part (e.g., (\mathbf{x}_2)) via a Sigmoid, thus controlling the flow of information.

3.3. Affine Flow Coupling in the Skip-Connection

Following the gating, we concatenate the transformed vectors (\mathbf{z}1) and (\mathbf{z}\text{gated}), and then project back to the original dimension. However, to preserve information if the gating is “too closed,” HoloGate-Flow introduces an affine transformation (scale + shift) in the residual. The scale and shift are themselves predicted by small feed-forward networks, reminiscent of affine coupling in normalizing flows.

3.4. Formal Description

Let (\mathbf{x} \in \mathbb{R}^d), with three sub-dimensions (d_1 + d_2 + d_3 = d). Split:

x1 = x[:d1]
x2 = x[d1 : d1+d2]
x3 = x[d1+d2:]

Apply linear transformations and activations:

z1 = activation1( W1(x1) )
z2 = activation2( W2(x2) )
z3 = activation3( W3(x3) )

Compute gating:

gate = sigmoid( z3 )
z_gated = gate * z2

Concatenate and project:

z_cat   = concat(z1, z_gated)
z_final = W_out( z_cat )

Flow-like residual:

scale = sigmoid( W_scale( LN(z_cat) ) )
shift = W_shift( LN(z_cat) )
out   = x + scale * z_final + shift

where LN denotes LayerNorm, ensuring stable ranges for scale and shift.

3.5. Advantages over Existing Methods

  1. Enhanced Flexibility: Multiple activations allow diverse nonlinear behaviors.
  2. Contextual Gating: Sub-space (\mathbf{x}_3) can learn to gate (\mathbf{x}_2) selectively.
  3. Robust Residual: Even if gating is near zero, the affine flow can re-inject or shift the original signal, mitigating vanishing pathways.
  4. Ease of Integration: Drop-in replacement for standard Transformer FFNs, albeit with increased parameter count.

4. Implementation Details

4.1. Dimensional Splitting

  • Equal Splits: (d1, d2, d3) = (d/3, d/3, d/3) is the simplest approach.
  • Unequal Splits: For some tasks, a bigger sub-space for gating might be beneficial (e.g., (d/4, d/2, d/4)).
  • Empirical Tuning: The ratio is an additional hyperparameter to tune.

4.2. Activation Functions

  • Common Choices: GELU, SiLU/Swish, ReLU, Tanh, Sigmoid.
  • Gating Branch: Typically uses Sigmoid as a final function for the gating signal. If z3 is pre-activated, activation3 could be an identity or a mild nonlinearity.

4.3. Layer Normalization and Stability

Applying LayerNorm to z_cat (the concatenation of the transformed sub-vectors) before computing scale and shift helps stabilize training. It prevents large magnitudes in z_cat from causing extreme scale or shift values.

4.4. Full Pseudocode (PyTorch)

import torch
import torch.nn as nn
import torch.nn.functional as F

class HoloGateFlow(nn.Module):
    def __init__(self, d_model, d1, d2, d3,
                 activation1='gelu',
                 activation2='silu',
                 activation3='none'):
        super().__init__()
        self.d1 = d1
        self.d2 = d2
        self.d3 = d3

        # Linear transformations
        self.W1 = nn.Linear(d1, d_model)
        self.W2 = nn.Linear(d2, d_model)
        self.W3 = nn.Linear(d3, d_model)

        # Final projection after concatenation
        self.W_out = nn.Linear(2 * d_model, d_model)

        # Flow-based scale & shift
        self.flow_scale = nn.Linear(2 * d_model, d_model)
        self.flow_shift = nn.Linear(2 * d_model, d_model)

        # Layer normalization
        self.norm = nn.LayerNorm(2 * d_model)

        # Activations
        self.activation1 = self._get_activation(activation1)
        self.activation2 = self._get_activation(activation2)
        self.activation3 = self._get_activation(activation3)

    def _get_activation(self, name):
        name = name.lower()
        if name == 'relu':
            return F.relu
        elif name == 'gelu':
            return F.gelu
        elif name in ('silu', 'swish'):
            return F.silu
        elif name == 'tanh':
            return torch.tanh
        elif name == 'none':
            return lambda x: x
        else:
            raise ValueError(f"Unknown activation: {name}")

    def forward(self, x):
        """
        x: (batch_size, d_model)
        """
        # 1) Split
        x1 = x[:, :self.d1]
        x2 = x[:, self.d1:self.d1 + self.d2]
        x3 = x[:, self.d1 + self.d2:]

        # 2) Transform
        z1 = self.activation1(self.W1(x1))
        z2 = self.activation2(self.W2(x2))
        z3 = self.activation3(self.W3(x3))

        # 3) Gating
        gate = torch.sigmoid(z3)
        z_gated = gate * z2

        # 4) Concat & project
        z_cat = torch.cat([z1, z_gated], dim=-1)
        z_final = self.W_out(z_cat)

        # 5) Flow-affine skip
        z_cat_norm = self.norm(z_cat)
        scale = torch.sigmoid(self.flow_scale(z_cat_norm))  # [0,1]
        shift = self.flow_shift(z_cat_norm)

        out = x + scale * z_final + shift
        return out

4.5. Parameter Count Considerations

Compared to a standard FFN, HoloGate-Flow introduces:

  • Extra linear layers (W3, plus the flow-based flow_scale & flow_shift).
  • Similar or slightly larger hidden dimensions (depending on how one sets up expansions).
  • Overall more parameters, which may be beneficial for large-scale tasks but must be justified on smaller problems or resource-constrained settings.

5. Theoretical Considerations

5.1. Expressivity and Nonlinearity

By applying multiple distinct activations (e.g., GeLU, SiLU, Sigmoid) to different sub-spaces, the network can approximate a broader class of functions. This synergy of heterogeneous nonlinearities can, in principle, learn more nuanced mappings.

5.2. Gradient Flow and Gating Behavior

Gating mechanisms can cause “closed” pathways if the Sigmoid output saturates at zero. The skip-connection plus affine flow ensures that the original input can still pass through the residual. This helps maintain stable gradients and avoid dead branches.

5.3. Connections to Normalizing Flows

Although the transformation is not strictly invertible, it borrows heavily from affine coupling ideas:

  • Partition input (here, conceptually z1 vs. z2/gate).
  • Compute scale and shift from some portion of the representation.
  • Apply them to the other part in a residual path.

5.4. Potential Limitations

  • Non-Invertibility: We lose the guarantee of invertibility from normalizing flows.
  • Increased Complexity: More parameters and hyperparameters might lead to longer tuning cycles.
  • Possibility of Overfitting: The richer capacity demands careful regularization in smaller datasets.

6. Experiments and Evaluation Protocols

Below are recommended experimental setups to validate HoloGate-Flow:

6.1. Language Modeling Experiments

  1. Baseline: A standard Transformer or GPT-like model with a 2-layer MLP FFN.
  2. HoloGate-Flow: Replace each FFN with the proposed block.
  3. Compare perplexity (PPL) on datasets like WikiText-103 or The Pile.
  4. Vary (d1, d2, d3) splits, activation combos (GELU, SiLU, ReLU), and see if perplexity improves.

6.2. Vision Transformer Use Cases

  1. ViT: Insert HoloGate-Flow in place of the 2-layer MLP on patches.
  2. Datasets: Evaluate on CIFAR-10, ImageNet, or smaller tasks.
  3. Metrics: Track top-1/top-5 accuracy, training curves, parameter usage.

6.3. Ablation Studies

  • Remove Flow: out = x + z_final (no scale/shift). Evaluate the difference.
  • Single Activation: Use the same activation for z1, z2, z3 to see if multiple activations truly help.
  • Different Normalization: RMSNorm vs. LayerNorm.
  • Varying Gate: Try Tanh or ReLU as gating. Check if the gating’s effect changes performance.

6.4. Metrics and Baselines

  • Quality: Perplexity (LM), accuracy (classification), F1 (NLP tasks), etc.
  • Efficiency: FLOPs, memory usage, speed.
  • Stability: Gradient norms, training loss curves, sensitivity to hyperparameters.

7. Practical Guidelines

7.1. Hyperparameter Tuning

  • Splitting Ratios: (d1, d2, d3) can follow (d/3, d/3, d/3) initially, but experiment with variations.
  • Activation Choices: Default to (GELU, SiLU, Sigmoid) or (ReLU, SiLU, Sigmoid). Evaluate performance differences.
  • Flow Scale/Shift: Typically keep scale in [0, 1] via Sigmoid. If we desire negative scaling, one might shift or use a different bounding strategy.

7.2. Initialization

  • Linear Layers: Use standard initializers (Xavier, Kaiming).
  • Bias: Initialize bias in flow_scale to slightly negative so that initial scale is near 0.5, preventing extremes at startup.

7.3. Regularization and Optimization

  • Weight Decay: Standard in large models to reduce overfitting.
  • Gradient Clipping: May be necessary if flow scale amplifies large values.
  • Learning Rate: Adjust depending on overall model size. With more parameters, sometimes a smaller LR is helpful.

7.4. Deployment Considerations

  • Compatibility: HoloGate-Flow can replace standard FFNs in any Transformer-based system with minimal changes to the code structure.
  • Inference Overhead: Additional matrix multiplications (W3, flow_scale, flow_shift) can add latency. Evaluate trade-offs carefully for real-time applications.

8. Discussion and Future Directions

8.1. Incorporating Other Flow Components

RealNVP and Glow use 1x1 invertible convolutions, log-determinant computations, etc. While not strictly necessary in HoloGate-Flow, exploring invertible transformations might yield interesting “reversible” layers for Transformers.

8.2. Multi-Layer HoloGate-Flow Blocks

One could stack multiple gating layers within a single FFN block, creating a deeper gating hierarchy. However, this may increase complexity and training time substantially.

8.3. Invertibility and Beyond

While HoloGate-Flow is not guaranteed invertible, future research might explore fully invertible variants that preserve the normalizing flow property, possibly allowing more advanced generative or probabilistic interpretations.

8.4. Potential for Multi-Modal Inputs

If x1, x2, x3 represent different modalities (e.g., text, audio, vision), gating can selectively fuse them. The affine flow in the residual might help unify multi-modal embeddings in a flexible manner.


9. Conclusion

HoloGate-Flow introduces a novel way to enhance the Transformer feed-forward layer by combining multi-activation gating and a flow-based affine residual step. This design allows for richer non-linear interactions, context-aware gating, and preservation of the input signal even when gating is minimal.

Though the approach increases the parameter budget and computational cost, preliminary considerations suggest potential benefits in terms of expressivity, stability, and gradient flow. We anticipate that thorough experimentation and ablation will confirm (or challenge) these hypothesized advantages. Future work may explore deeper gating hierarchies, invertible transformations, and specialized domain applications (e.g., multi-modal tasks).


References

  1. Vaswani et al. (2017)
    Attention Is All You Need. NIPS.
    https://arxiv.org/abs/1706.03762

  2. Dauphin et al. (2017)
    Language Modeling with Gated Convolutional Networks. ICML.
    https://arxiv.org/abs/1612.08083

  3. Shazeer, Noam (2020)
    GLU Variants Improve Transformer.
    https://arxiv.org/abs/2002.05202

  4. Dinh et al. (2017)
    Density Estimation using Real NVP.
    https://arxiv.org/abs/1605.08803

  5. Kingma et al. (2018)
    Glow: Generative Flow with Invertible 1x1 Convolutions.
    https://arxiv.org/abs/1807.03039

  6. Ramachandran et al. (2017)
    Searching for Activation Functions.
    https://arxiv.org/abs/1710.05941

  7. PyTorch Documentation
    https://pytorch.org/docs


Appendix A: Extended Pseudocode / Alternatives

Below is an alternative “lightweight” version if you wish to remove extra complexity:

class HoloGateFlowLite(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_model * 3)  # Single linear to produce x1, x2, x3
        self.fc2 = nn.Linear(d_model * 2, d_model)
        self.norm = nn.LayerNorm(d_model * 2)
        self.flow_scale = nn.Linear(d_model * 2, d_model)
        self.flow_shift = nn.Linear(d_model * 2, d_model)

    def forward(self, x):
        combined = self.fc1(x)  # shape: (batch, d_model*3)
        d = combined.size(-1) // 3
        z1, z2, z3 = torch.split(combined, d, dim=-1)

        # Activations
        z1 = F.gelu(z1)
        z2 = F.silu(z2)
        gate = torch.sigmoid(z3)
        z_gated = gate * z2

        z_cat = torch.cat([z1, z_gated], dim=-1)
        z_final = self.fc2(z_cat)

        z_cat_norm = self.norm(z_cat)
        scale = torch.sigmoid(self.flow_scale(z_cat_norm))
        shift = self.flow_shift(z_cat_norm)

        return x + scale * z_final + shift

Appendix B: Additional Figures and Diagrams

  1. Block Diagram:

    +----------------+         +----------------+
    |   x (input)    |         |  x (residual) |
    +--------+-------+         +--------+-------+
             |                        |
             |     Split into 3       |
             v                        |
      +------+------+                 |
      | x1   | x2   | x3  |           |
      +------+------+-----+           |
        |      |      |               |
        |W1    |W2    |W3             |
        |      |      |               |
        v      v      v               |
      z1->Act1 z2->Act2 z3->Act3      |
             +-------+                |
             | Sigmoid               |
             +-------+                |
                * gating              |
                z_gated = z2 * sig(z3)|
             +------------------------+
             | concat(z1, z_gated)   |
             +-----------+------------+
                         | W_out
                         v
                       z_final
                         | 
              +----------+----------+
              |  scale, shift from  |
              |  LN(z1,z_gated)     |
              +----------+----------+
                         |
           out = x + scale * z_final + shift
    
  2. Flow Affine Concept:

    scale = Sigmoid( W_scale( LN(...) ) )
    shift = W_shift( LN(...) )
    out   = x + scale * z_final + shift
    

These diagrams illustrate the major components of HoloGate-Flow and how the gating plus flow-affine skip-connection work in tandem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment