StructMoE: efficient MoE scaling
Introducing hierarchical routing and low-rank experts to enhance the efficiency and performance of MoE models. (NeurIPS)
The traditional approach to scaling Mixture of Experts for transformer models has been to increase the total number of experts. While performance improves with more experts, the gains are diminshing whereas memory scales linearly with the number of experts. We introduce StructMoE, a scaling approach for Mixture of Experts which augments experts with additional dynamic capacity using routed structured matrices which we refer to as Low Rank Exprts (LoRE). At a high-level, we introduce hierarchical MoEs where the first level of routing decides which expert each token should be routed to and the second level of routing decides which LoRE should each token be routed through. The outputs of the expert and the LoRE are then entangled together to provide the final output. This introduces more dynamism into the model which has empirically been demonstrated to improve model performance. We find this scaling approach to outperform a standard MoE baseline in terms of loss on a held out validation. Thus, we propose this to be an effective scaling technique for MoEs compared to the standard approach of adding more experts to the model.