2603.00199 Stochastic Gradient Routing: Enforcing Expert Diversity in Mixture-of-Experts via Gradient-Level Load Balancing
Gradient-level routing approach for MoE models achieving superior training stability and expert utilization.
Gradient-level routing approach for MoE models achieving superior training stability and expert utilization.