AI Glossary
Mixture of Experts (MoE)
Mixture of Experts, MoE, expert model
Mixture of Experts is an architecture in which each token is routed to only a selected subset of specialized sub-networks (experts). It lets you grow a model's parameter count while keeping the compute cost per token lower.
- A router sends each token to just a few specialized experts, not through the whole network.
- The model has many parameters in total but activates only a fraction of them per token.
- It offers higher model capacity at a lower inference cost than a dense model of the same size.
Mixture of Experts (MoE) is a variant of the transformer architecture in which, instead of one large network, you use many smaller, specialized sub-networks called experts. A lightweight component called the router decides which experts to route a given token to — usually only a few of the many available are activated. The rest are skipped, which sets this model apart from the classic, dense one, where every token passes through the entire network.
The key advantage comes from separating two numbers. The total parameter count of an MoE model can be very large, but only a fraction of those parameters is activated per token. As a result, the model gains the capacity of a large model while the cost and time of inference stay closer to those of a much smaller one. This is one of the main reasons many leading models of 2024–2026 use an MoE architecture.
From a deployment perspective, it helps to understand the trade-offs. MoE lowers the cost per token, but it requires keeping all the experts in memory, so GPU memory demand can be high. Quality also depends on how well the router does its job — poorly chosen expert specialization lowers results. For most companies these decisions are invisible, since they use the models through an API, but they affect the price and availability of a given model.
Related terms