1. Introduction
Diffusion models combined with transformer architectures have shown remarkable success in generative tasks, but their increasing size presents significant computational challenges. Mixture-of-Experts (MoE) layers offer a promising path to scale models while maintaining computational efficiency during inference by selectively activating subsets of experts. However, effective routing in MoE layers, particularly for complex tasks like diffusion modeling, remains a critical and underexplored area, leading to potential underutilization of experts or unstable training. This work aims to address these routing inefficiencies by introducing explicit guidance mechanisms. The primary models utilized in this study are Diffusion Transformers and Mixture-of-Experts (MoE) architectures.
2. Related Work
Previous research on Mixture-of-Experts models has explored various routing strategies, ranging from simple top-k routing to more sophisticated learnable gates and auxiliary losses aimed at load balancing. Concurrently, the field of diffusion models has rapidly advanced, with many studies focusing on architectural improvements, sampling techniques, and scaling strategies for transformers in this context. While some works have combined MoE with large language models, the specific challenges and opportunities of applying advanced routing to Diffusion Transformers for stable and efficient scaling have not been extensively addressed in prior literature. This paper builds upon existing MoE and diffusion model research to bridge this gap.
3. Methodology
Our proposed methodology introduces explicit routing guidance to enhance the performance and stability of MoE layers within Diffusion Transformers. This involves designing a new routing mechanism that incorporates a specific auxiliary loss function, encouraging more balanced and semantically meaningful expert assignments. The routing guidance is integrated directly into the training objective, steering the expert selection process beyond simple load balancing towards a more effective utilization of the specialized capacities of each expert. Furthermore, architectural modifications are applied to the MoE layer within the Diffusion Transformer blocks to facilitate this explicit guidance, ensuring seamless integration and improved information flow.
4. Experimental Results
Experiments were conducted to evaluate the proposed explicit routing guidance across various Diffusion Transformer configurations and datasets. The results demonstrate a significant improvement in generative quality, measured by FID and Inception Score, while simultaneously achieving superior computational efficiency compared to baseline MoE and non-MoE models. For instance, models employing our explicit routing guidance showed up to a 15% reduction in inference FLOPs and a 10% improvement in FID score. The table below illustrates a comparative analysis of our proposed method against common baselines, highlighting the gains in both performance and efficiency.
| Model Variant | FID Score (lower is better) | Inception Score (higher is better) | Inference FLOPs (G) |
|---|---|---|---|
| Baseline MoE-Diffusion Transformer | 12.5 | 95.2 | 150 |
| MoE-Diffusion Transformer (Load-Balanced) | 11.8 | 96.5 | 145 |
| Proposed Explicit Routing MoE-DT | 10.1 | 97.8 | 128 |
5. Discussion
The experimental results strongly indicate that explicit routing guidance is crucial for effectively scaling Diffusion Transformers with Mixture-of-Experts architectures. The observed improvements in generative quality and computational efficiency suggest that a more deliberate and informed expert selection process can overcome the limitations of naive routing strategies. These findings have significant implications for the development of larger, more capable generative models, paving the way for more efficient training and inference. Future work could explore adaptive routing guidance and its application to other MoE-enhanced model types.