How AI Gets Bigger Without Getting Slower
There's a tension built into the scaling approach that has driven AI progress for the past several years.
Larger models are more capable. But larger models are also more expensive to run, because every token generated requires computation across every parameter in the model. Double the parameters, roughly double the inference cost. At the scale of frontier models, inference costs are already substantial. Scaling further with a dense architecture, one where every parameter participates in every forward pass, becomes increasingly difficult to justify economically.
Mixture of experts is the architectural innovation that lets you have it both ways. Or closer to both ways than a dense model allows.
The core idea is conditional computation. Rather than activating all of a model's parameters for every token, a mixture of experts model activates only a subset. The model is divided into a large number of specialized subnetworks called experts, each of which handles different kinds of inputs or different aspects of computation. A separate component called a router examines each token and decides which experts should process it. Only the selected experts activate for that token. The rest sit idle.
The result is a model that has many more total parameters than a dense model of equivalent inference cost, because most of those parameters aren't being used for any given token. A model might have hundreds of billions of parameters in its expert layers but only activate tens of billions for each token. The total parameter count determines what the model can potentially know and represent. The activated parameter count determines what each forward pass actually costs.
This distinction between total parameters and active parameters is the key to understanding why mixture of experts is interesting. A mixture of experts model with 100 billion total parameters and 20 billion active parameters per token has the representational capacity of a very large model with the inference cost of a medium-sized one. The tradeoff isn't free, the routing mechanism adds overhead and the expert layers need to be stored in memory even when idle, but it's substantially better than the tradeoff a dense architecture offers.
The routing mechanism is where much of the interesting engineering lives. A simple router might just pick the top two experts by score for each token, a design called top-k routing. The scores are produced by a learned gating network that takes the token's representation as input and produces a weight for each expert. The experts with the highest weights get activated. Load balancing is a significant practical concern: if the router consistently sends most tokens to a small number of experts, the other experts don't learn much and the architecture's benefits are reduced. Training mixture of experts models requires explicit mechanisms to encourage balanced routing, typically auxiliary loss terms that penalize the router for concentrating traffic on a subset of experts.
Mixture of experts isn't a new idea. The basic architecture was proposed in a 1991 paper by Michael Jordan and Robert Jacobs, long before modern deep learning. What changed is scale and the practical infrastructure needed to train and serve very large sparse models efficiently. Routing decisions need to be fast. Experts that aren't co-located on the same hardware create communication overhead when they're selected. The engineering of large-scale mixture of experts systems is considerably more complex than dense model serving, and getting it right requires hardware-aware design choices that don't arise in simpler architectures.
Several prominent models use mixture of experts architectures. Google's Switch Transformer demonstrated the approach at scale in 2021. Mistral's Mixtral models are mixture of experts architectures that have achieved strong performance at relatively modest active parameter counts. GPT-4 is widely believed, though not officially confirmed, to use a mixture of experts design. The pattern is consistent: mixture of experts allows frontier labs to build models with substantially more total capacity than a dense architecture of equivalent inference cost would provide.
For practitioners, the mixture of experts architecture has implications for how model capabilities should be interpreted. A model described as having a certain number of parameters may be a sparse model where only a fraction of those parameters activate per token, which affects how the model's computational cost and capability should be compared to dense models of nominally similar size. Parameter count comparisons between dense and sparse models are not apples-to-apples, and benchmarks that don't account for active parameter count can be misleading about what's actually being compared.
The deeper implication of mixture of experts for AI development is that the relationship between model size, inference cost, and capability is more flexible than the dense scaling paradigm suggested. Specialization within a model, different experts developing different competencies and being selectively activated for different inputs, is a way of increasing total capacity without proportionally increasing the cost of using that capacity. Whether this architectural approach continues to scale as well as dense models, or whether it introduces limitations that become apparent at sufficient scale, remains an active research question. For now, it represents one of the more important architectural ideas in frontier AI development.