As Large Language Models scale toward trillions of parameters, computational and memory efficiency have become the primary bottlenecks in AI hardware deployment. To combat this, the industry has shifted toward two parallel paradigms: Mixture-of-Experts (MoE) architectures, which activate only a sparse subset of the network per token, and FP8 (8-bit Floating Point) quantization, which slashes memory bandwidth and tensor core execution times.

However, intersecting ultra-low precision with sparse routing introduces severe optimization challenges—most notably, the exacerbation of the vanishing gradient problem.

The Fragility of Sparse Routing and Underflow

In a standard dense transformer, gradients flow back through uniform layers. In an MoE architecture, a gating (routing) network computes a probability distribution to dispatch tokens to the top-$k$ most relevant experts. The routing decision relies on soft gating functions, such as:

$$G(x) = \text{Softmax}(W_g \cdot x + \epsilon)$$

When transitioning from FP16 or BF16 to FP8 formats (such as E4M3 or E5M2), the dynamic range of representable numbers shrinks drastically.

The primary catalyst for vanishing gradients in this setup is numerical underflow. In FP8, the smallest representable non-zero normalized value is significantly higher than in BF16. During backward propagation, the gradients flowing through the gating network and the unselected experts are frequently multiplied by small fractional weights. In an FP8 landscape, these micro-gradients instantly underflow to absolute zero.

The MoE Gating Collapse

When gradients underflow to zero in an MoE layer, the backward pass fails to update the gating weights ($W_g$). This triggers a catastrophic failure mode known as gating collapse or expert starvation:

  • The router stops learning which expert is best suited for specific tokens.

  • A few initially dominant experts receive all the tokens, leading to hardware bottlenecks.

  • The remaining unselected experts receive zero gradient updates, rendering a vast portion of the sparse model completely useless ("dead experts").

Because the representation capacity of an MoE depends entirely on expert specialization, vanishing gradients caused by FP8 underflow directly degrade the final downstream accuracy of the model.

Mitigation Strategies for 2026 Architectures

To leverage FP8 speed without succumbing to vanishing gradients, cutting-edge training frameworks employ targeted architectural remedies:

  • Mixed-Precision Routing (Mixed-FP8): While the expert MLP layers (which consume the bulk of the FLOPs) are computed in FP8, the gating network and routing logic are strictly maintained in BF16 or FP16.

  • Stochastic Quantization and Dynamic Scaling: Implementing per-tensor or per-block dynamic scaling factors. By constantly shifting the exponent bias based on the maximum absolute value of the gradients ($-\text{max}(|\nabla|)$), the hardware artificially keeps the gradient values within the narrow FP8 representable window.

  • Expert Regularization via Load Balancing: Injecting a auxiliary load-balancing loss computed in higher precision ensures that even if gradients temporarily vanish, the router is mathematically forced to distribute tokens evenly.

Conclusion

While FP8 is essential for the economic sustainability of frontier AI models, its adoption within Mixture-of-Experts architectures requires surgical precision. Without proactive dynamic scaling and isolated high-precision routing, the narrow numerical range of 8-bit floats will reliably choke gradient flow, turning a highly sophisticated sparse brain into an underperforming, rigid network.