Author: Jacques Gariépy, [email protected] Date: January 2025
Modern neural architectures, especially Transformers, rely on feed-forward networks (FFNs) to process representations after self-attention. Recent innovations such as Gated Linear Units (GLU, GeGLU, SwiGLU) have demonstrated that introducing gating mechanisms can substantially boost performance on tasks ranging from language modeling to computer vision. Concurrently, normalizing flows have provided a powerful framework for invertible transformations (scale and shift) used in generative modeling.
In this thesis-like document, we propose HoloGate-Flow, a new FFN design that:
- Splits the input into multiple sub-spaces, each undergoing different transformations and activations.
- Introduces a gating mechanism that modulates one sub-space with another.
- Incorporates an affine flow-inspired step in the skip-connection to preserve critical information, even when the gate closes.
The result is a more expressive and robust feed-forward layer that mitigates the risk of losing input signals, potentially improving gradient flow and overall model capacity. We provide comprehensive details on its formulation, implementation, theoretical considerations, and avenues for practical deployment.
The Transformer architecture has revolutionized natural language processing and has increasingly been adapted to other domains. While the self-attention mechanism is widely recognized for its effectiveness, the feed-forward component also significantly contributes to the model’s expressivity and capacity. In standard Transformers, this FFN is often a two-layer MLP with a single nonlinear activation. Yet, as the number of parameters grows, even small design changes in the FFN can yield notable performance benefits.
This document aims to:
- Present the conceptual foundation of HoloGate-Flow.
- Compare it with well-known gating approaches (e.g., GLU, GeGLU, SwiGLU).
- Demonstrate how affine transformations from normalizing flows can enhance skip-connections.
- Offer guidance on implementing and experimenting with HoloGate-Flow in various deep learning tasks.
The original Transformer [Vaswani et al., 2017] introduced a simple feed-forward sub-layer: [ \text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}W_1 + b_1) W_2 + b_2, ] inserted after multi-head attention, with a skip-connection: (\mathbf{x}_{\text{out}} = \mathbf{x} + \text{FFN}(\mathbf{x})). Variants of this FFN have been proposed to improve representation learning and parameter efficiency.
Gated Linear Units (GLU) [Dauphin et al., 2017] showed that splitting the hidden dimension into two parts and applying a Sigmoid gating on one portion can improve performance. GeGLU and SwiGLU [Shazeer, 2020] refined this by replacing ReLU with GeLU or Swish/SiLU, further boosting performance and stability.
Normalizing flows like RealNVP [Dinh et al., 2017] and Glow [Kingma et al., 2018] introduced affine coupling layers for invertible transformations. Although designed for generative modeling, the concept of “scale and shift conditioned on a partition of the input” can inspire new skip-connection designs in discriminative models as well.
Aside from gating, researchers have also tested multi-branch FFNs, depthwise convolutions, and dynamic routing. However, few attempts combine gating with flow-like affine transforms in the residual path, which is precisely the focus of this work.
A typical Transformer block is:
x_in
└─> [Multi-Head Attention + Residual] ─> x_mha
└─> [FFN + Residual] ─> x_out
HoloGate-Flow replaces the standard FFN with a more elaborate structure, maintaining the residual connection but augmenting it with multi-activation gating and a flow-based affine step.
Instead of having a single feed-forward path, HoloGate-Flow splits the input vector (\mathbf{x}) into three parts (\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3). Each part is transformed by a distinct linear layer (or MLP) and activation function. One of these (typically (\mathbf{x}_3)) becomes a “meta-gate” applied to another part (e.g., (\mathbf{x}_2)) via a Sigmoid, thus controlling the flow of information.
Following the gating, we concatenate the transformed vectors (\mathbf{z}1) and (\mathbf{z}\text{gated}), and then project back to the original dimension. However, to preserve information if the gating is “too closed,” HoloGate-Flow introduces an affine transformation (scale + shift) in the residual. The scale and shift are themselves predicted by small feed-forward networks, reminiscent of affine coupling in normalizing flows.
Let (\mathbf{x} \in \mathbb{R}^d), with three sub-dimensions (d_1 + d_2 + d_3 = d). Split:
x1 = x[:d1]
x2 = x[d1 : d1+d2]
x3 = x[d1+d2:]
Apply linear transformations and activations:
z1 = activation1( W1(x1) )
z2 = activation2( W2(x2) )
z3 = activation3( W3(x3) )
Compute gating:
gate = sigmoid( z3 )
z_gated = gate * z2
Concatenate and project:
z_cat = concat(z1, z_gated)
z_final = W_out( z_cat )
Flow-like residual:
scale = sigmoid( W_scale( LN(z_cat) ) )
shift = W_shift( LN(z_cat) )
out = x + scale * z_final + shift
where LN denotes LayerNorm, ensuring stable ranges for scale and shift.
- Enhanced Flexibility: Multiple activations allow diverse nonlinear behaviors.
- Contextual Gating: Sub-space (\mathbf{x}_3) can learn to gate (\mathbf{x}_2) selectively.
- Robust Residual: Even if gating is near zero, the affine flow can re-inject or shift the original signal, mitigating vanishing pathways.
- Ease of Integration: Drop-in replacement for standard Transformer FFNs, albeit with increased parameter count.
- Equal Splits:
(d1, d2, d3) = (d/3, d/3, d/3)is the simplest approach. - Unequal Splits: For some tasks, a bigger sub-space for gating might be beneficial (e.g.,
(d/4, d/2, d/4)). - Empirical Tuning: The ratio is an additional hyperparameter to tune.
- Common Choices: GELU, SiLU/Swish, ReLU, Tanh, Sigmoid.
- Gating Branch: Typically uses Sigmoid as a final function for the gating signal. If
z3is pre-activated,activation3could be an identity or a mild nonlinearity.
Applying LayerNorm to z_cat (the concatenation of the transformed sub-vectors) before computing scale and shift helps stabilize training. It prevents large magnitudes in z_cat from causing extreme scale or shift values.
import torch
import torch.nn as nn
import torch.nn.functional as F
class HoloGateFlow(nn.Module):
def __init__(self, d_model, d1, d2, d3,
activation1='gelu',
activation2='silu',
activation3='none'):
super().__init__()
self.d1 = d1
self.d2 = d2
self.d3 = d3
# Linear transformations
self.W1 = nn.Linear(d1, d_model)
self.W2 = nn.Linear(d2, d_model)
self.W3 = nn.Linear(d3, d_model)
# Final projection after concatenation
self.W_out = nn.Linear(2 * d_model, d_model)
# Flow-based scale & shift
self.flow_scale = nn.Linear(2 * d_model, d_model)
self.flow_shift = nn.Linear(2 * d_model, d_model)
# Layer normalization
self.norm = nn.LayerNorm(2 * d_model)
# Activations
self.activation1 = self._get_activation(activation1)
self.activation2 = self._get_activation(activation2)
self.activation3 = self._get_activation(activation3)
def _get_activation(self, name):
name = name.lower()
if name == 'relu':
return F.relu
elif name == 'gelu':
return F.gelu
elif name in ('silu', 'swish'):
return F.silu
elif name == 'tanh':
return torch.tanh
elif name == 'none':
return lambda x: x
else:
raise ValueError(f"Unknown activation: {name}")
def forward(self, x):
"""
x: (batch_size, d_model)
"""
# 1) Split
x1 = x[:, :self.d1]
x2 = x[:, self.d1:self.d1 + self.d2]
x3 = x[:, self.d1 + self.d2:]
# 2) Transform
z1 = self.activation1(self.W1(x1))
z2 = self.activation2(self.W2(x2))
z3 = self.activation3(self.W3(x3))
# 3) Gating
gate = torch.sigmoid(z3)
z_gated = gate * z2
# 4) Concat & project
z_cat = torch.cat([z1, z_gated], dim=-1)
z_final = self.W_out(z_cat)
# 5) Flow-affine skip
z_cat_norm = self.norm(z_cat)
scale = torch.sigmoid(self.flow_scale(z_cat_norm)) # [0,1]
shift = self.flow_shift(z_cat_norm)
out = x + scale * z_final + shift
return outCompared to a standard FFN, HoloGate-Flow introduces:
- Extra linear layers (
W3, plus the flow-basedflow_scale&flow_shift). - Similar or slightly larger hidden dimensions (depending on how one sets up expansions).
- Overall more parameters, which may be beneficial for large-scale tasks but must be justified on smaller problems or resource-constrained settings.
By applying multiple distinct activations (e.g., GeLU, SiLU, Sigmoid) to different sub-spaces, the network can approximate a broader class of functions. This synergy of heterogeneous nonlinearities can, in principle, learn more nuanced mappings.
Gating mechanisms can cause “closed” pathways if the Sigmoid output saturates at zero. The skip-connection plus affine flow ensures that the original input can still pass through the residual. This helps maintain stable gradients and avoid dead branches.
Although the transformation is not strictly invertible, it borrows heavily from affine coupling ideas:
- Partition input (here, conceptually
z1vs.z2/gate). - Compute scale and shift from some portion of the representation.
- Apply them to the other part in a residual path.
- Non-Invertibility: We lose the guarantee of invertibility from normalizing flows.
- Increased Complexity: More parameters and hyperparameters might lead to longer tuning cycles.
- Possibility of Overfitting: The richer capacity demands careful regularization in smaller datasets.
Below are recommended experimental setups to validate HoloGate-Flow:
- Baseline: A standard Transformer or GPT-like model with a 2-layer MLP FFN.
- HoloGate-Flow: Replace each FFN with the proposed block.
- Compare perplexity (PPL) on datasets like WikiText-103 or The Pile.
- Vary
(d1, d2, d3)splits, activation combos (GELU, SiLU, ReLU), and see if perplexity improves.
- ViT: Insert HoloGate-Flow in place of the 2-layer MLP on patches.
- Datasets: Evaluate on CIFAR-10, ImageNet, or smaller tasks.
- Metrics: Track top-1/top-5 accuracy, training curves, parameter usage.
- Remove Flow:
out = x + z_final(no scale/shift). Evaluate the difference. - Single Activation: Use the same activation for
z1, z2, z3to see if multiple activations truly help. - Different Normalization: RMSNorm vs. LayerNorm.
- Varying Gate: Try Tanh or ReLU as gating. Check if the gating’s effect changes performance.
- Quality: Perplexity (LM), accuracy (classification), F1 (NLP tasks), etc.
- Efficiency: FLOPs, memory usage, speed.
- Stability: Gradient norms, training loss curves, sensitivity to hyperparameters.
- Splitting Ratios:
(d1, d2, d3)can follow(d/3, d/3, d/3)initially, but experiment with variations. - Activation Choices: Default to
(GELU, SiLU, Sigmoid)or(ReLU, SiLU, Sigmoid). Evaluate performance differences. - Flow Scale/Shift: Typically keep scale in
[0, 1]via Sigmoid. If we desire negative scaling, one might shift or use a different bounding strategy.
- Linear Layers: Use standard initializers (Xavier, Kaiming).
- Bias: Initialize bias in
flow_scaleto slightly negative so that initial scale is near 0.5, preventing extremes at startup.
- Weight Decay: Standard in large models to reduce overfitting.
- Gradient Clipping: May be necessary if flow scale amplifies large values.
- Learning Rate: Adjust depending on overall model size. With more parameters, sometimes a smaller LR is helpful.
- Compatibility: HoloGate-Flow can replace standard FFNs in any Transformer-based system with minimal changes to the code structure.
- Inference Overhead: Additional matrix multiplications (
W3,flow_scale,flow_shift) can add latency. Evaluate trade-offs carefully for real-time applications.
RealNVP and Glow use 1x1 invertible convolutions, log-determinant computations, etc. While not strictly necessary in HoloGate-Flow, exploring invertible transformations might yield interesting “reversible” layers for Transformers.
One could stack multiple gating layers within a single FFN block, creating a deeper gating hierarchy. However, this may increase complexity and training time substantially.
While HoloGate-Flow is not guaranteed invertible, future research might explore fully invertible variants that preserve the normalizing flow property, possibly allowing more advanced generative or probabilistic interpretations.
If x1, x2, x3 represent different modalities (e.g., text, audio, vision), gating can selectively fuse them. The affine flow in the residual might help unify multi-modal embeddings in a flexible manner.
HoloGate-Flow introduces a novel way to enhance the Transformer feed-forward layer by combining multi-activation gating and a flow-based affine residual step. This design allows for richer non-linear interactions, context-aware gating, and preservation of the input signal even when gating is minimal.
Though the approach increases the parameter budget and computational cost, preliminary considerations suggest potential benefits in terms of expressivity, stability, and gradient flow. We anticipate that thorough experimentation and ablation will confirm (or challenge) these hypothesized advantages. Future work may explore deeper gating hierarchies, invertible transformations, and specialized domain applications (e.g., multi-modal tasks).
-
Vaswani et al. (2017)
Attention Is All You Need. NIPS.
https://arxiv.org/abs/1706.03762 -
Dauphin et al. (2017)
Language Modeling with Gated Convolutional Networks. ICML.
https://arxiv.org/abs/1612.08083 -
Shazeer, Noam (2020)
GLU Variants Improve Transformer.
https://arxiv.org/abs/2002.05202 -
Dinh et al. (2017)
Density Estimation using Real NVP.
https://arxiv.org/abs/1605.08803 -
Kingma et al. (2018)
Glow: Generative Flow with Invertible 1x1 Convolutions.
https://arxiv.org/abs/1807.03039 -
Ramachandran et al. (2017)
Searching for Activation Functions.
https://arxiv.org/abs/1710.05941 -
PyTorch Documentation
https://pytorch.org/docs
Below is an alternative “lightweight” version if you wish to remove extra complexity:
class HoloGateFlowLite(nn.Module):
def __init__(self, d_model):
super().__init__()
self.fc1 = nn.Linear(d_model, d_model * 3) # Single linear to produce x1, x2, x3
self.fc2 = nn.Linear(d_model * 2, d_model)
self.norm = nn.LayerNorm(d_model * 2)
self.flow_scale = nn.Linear(d_model * 2, d_model)
self.flow_shift = nn.Linear(d_model * 2, d_model)
def forward(self, x):
combined = self.fc1(x) # shape: (batch, d_model*3)
d = combined.size(-1) // 3
z1, z2, z3 = torch.split(combined, d, dim=-1)
# Activations
z1 = F.gelu(z1)
z2 = F.silu(z2)
gate = torch.sigmoid(z3)
z_gated = gate * z2
z_cat = torch.cat([z1, z_gated], dim=-1)
z_final = self.fc2(z_cat)
z_cat_norm = self.norm(z_cat)
scale = torch.sigmoid(self.flow_scale(z_cat_norm))
shift = self.flow_shift(z_cat_norm)
return x + scale * z_final + shift-
Block Diagram:
+----------------+ +----------------+ | x (input) | | x (residual) | +--------+-------+ +--------+-------+ | | | Split into 3 | v | +------+------+ | | x1 | x2 | x3 | | +------+------+-----+ | | | | | |W1 |W2 |W3 | | | | | v v v | z1->Act1 z2->Act2 z3->Act3 | +-------+ | | Sigmoid | +-------+ | * gating | z_gated = z2 * sig(z3)| +------------------------+ | concat(z1, z_gated) | +-----------+------------+ | W_out v z_final | +----------+----------+ | scale, shift from | | LN(z1,z_gated) | +----------+----------+ | out = x + scale * z_final + shift -
Flow Affine Concept:
scale = Sigmoid( W_scale( LN(...) ) ) shift = W_shift( LN(...) ) out = x + scale * z_final + shift
These diagrams illustrate the major components of HoloGate-Flow and how the gating plus flow-affine skip-connection work in tandem.