Mixture of Experts (MoE)

Definition

An architectural pattern in which a language model consists of several specialised "expert" subnetworks, only a small selection of which is activated per token (sparse activation). Allows significantly higher total parameter counts at comparable inference compute cost. 2026 examples: the Mixtral family, the DeepSeek-V3 line and several frontier models whose architecture isn't public but is presumed to be MoE.

Noise — Signal

MoE is sold as "the architecture with which we scale efficiently". In fact it just shifts the scaling axis: less inference compute per token, but higher memory requirements (all experts must be loaded), greater complexity in routing, and harder hardware utilisation — especially in on-premises setups with limited GPU memory. From the user-side perspective the relevant point is: at comparable quality, MoE models can be cheaper per token as long as the hosting carries the memory overhead; on your own hardware that assumption is not a given.

The right question

Not: "Should we deploy MoE models?" But: "What implications does the MoE architecture have for our hosting (GPU memory, utilisation), our latency requirements and availability on on-premises stacks compared to dense models of similar quality?"

Mixture of Experts (MoE)

Definition

Noise — Signal

The right question

Related terms

Foundation Model

Inference Cost / TCO

On-Premises AI