MoE: faster & more powerful, the new kings of LLMs!

In this article, we introduce a new form of LLM architecture: Mixture Of Experts. MoEs make LLMs more efficient and less expensive, but we'll also explain why they're not the best choice for every use case.


Bastien Zimmermann
R&D Engineer

All articles




To keep abreast of the latest developments in artificial intelligence, we turned to LLM benchmarks to take the temperature of which LLMs are dominant. Starting with the best-known benchmark, the Open LLM leaderboard, we immediately notice that the top 10 is invaded by a slew of models with "MoE" in their names?!

Looking for validation of this trend, we turn to the Chatbot Arena Benchmark, recognized as the benchmark for evaluating the best overall language model. Here again, the top 10 is overrun with MoEs! Indeed, GPT4 is suspected of being one, and is accompanied by Mistral's model, Mixtral (Mix indicating its MoE architecture).

Faced with this invasion, it becomes essential to introduce what MoEs are and explain the performance gains that come from using them.

Mixture of Experts (MOE): expectations

On the face of it, mixtures of experts seem to be a fabulous solution that promises :

  • Better performance,
  • Faster (and therefore less expensive) model training,
  • Significantly lower inference costs.

But what is a MoE?

Let's start by explaining how it's possible to improve LLM performance. A machine learning practitioner who has witnessed the historic rise of deep learning will start by simply increasing the number of parameters in our model. Mixtures of Experts is a more refined way of doing this.

Architecture of the mixture of Experts

What is an expert? It's simply a neural network with unique, independently learned weights.

The second building block in the mixtures of experts is the router:

The router takes the tokens given to the model as input and redirects them to the most appropriate expert(s). This is an important detail, as each token (~word) can pass through a different expert. The results of the various experts are then aggregated and normalized to form the model output.

More formally, the router is a learned model ($G$) and the MoEs operation can be represented by the following equation (with $E_i$ the expert networks):

$$y = \sum_{i=1}^{n}{G(x)_iE_i(x)}$$

One gating function that could be chosen is Softmax. Another example is Mixtral's gating function, which adds a $TopK$ function to keep only the K best experts for each token (the $SwiGlu_i$ representing the expert networks):

$$y = \sum_{i=0}^{n-1}{Softmax(TopK(x\cdot W_g}))_i \cdot SwiGlu_i(x)$$

The case of transformers

An important detail we've omitted so far is that the MoEs we mentioned in the introduction are present in Transformer architectures, and this introduces some nuances.

Experts and their router mechanism are introduced in place of the feed-forward blocks of the transformers architecture. The attention layer is shared (for the less mathematically inclined, this explains why Mistral 8x7b is actually a 47b model and not a 56b). With this, we can also see that the experts in each Transformers block are different.

MoEs: Why are they better?

  • More effective inference/pre-training

MoEs are the way to get extra parameters "for free". Indeed, the Top-k mechanism ensures that most of the experts are not used, so when inferring, we only use a fraction of the parameters. This fraction is optimized thanks to the router mechanism, which is also trained.

  • Training is less expensive.

MoE training is more efficient. By dividing the parameters between different experts, the number of gradients to be calculated is greatly reduced. As a result, training converges more quickly, which saves both time and money!

Is there a specialization of experts?

According to common perception, an expert is a person who is competent in a particular field. Is this also true of our MoEs experts?

The experts are not thematic experts, as our intuition might suggest. Indeed, there is no expert who receives a significantly higher load than another on a given The Pile dataset topic. The only exception is "DM Mathematics", which seems to point to a syntactic specialization.

Jiang et al. Mixtral of Experts.

However, it also seems that some experts are specialized in the processing of certain tokens, such as punctuation tokens or conjunctions. However, these specialization results remain the exception, and the results in the following table are selected from other examples without a clear semantic or syntactic specialization.


The disadvantages of MoEs

In addition to their different nature, MoEs models behave differently from conventional models when it comes to fine-tuning. MoEs are more prone to overfitting. As a general rule, the smaller the number of experts, the easier it will be to fine-tune our model. There are exceptions, however, such as the TriviaQA dataset, for which MoE fine-tuning excels.

Even if the MoEs use only a fraction of their parameters, it is still necessary to load the model and all its experts into memory. This means that machines with VRAM (video RAM) are required to select MoEs.

This feature suggests that MoEs should be chosen if sufficient VRAM is available and a high inference throughput is required. Otherwise, a dense model (as opposed to MoEs) is more appropriate.


Mixtures of Experts are a simple way of increasing the performance of LLMs, all at a lower learning cost and reduced inference cost. However, this architecture comes with its own challenges. Such models require additional memory resources (VRAM) to be able to load the model, and this is compounded by a loss of adaptability of the model, which is a priori more difficult to fine-tune.

In short, Mixtures of Experts represent a major step forward in performance, but require substantial investment in hardware and expertise to be fully exploited.


A platform compatible with the entire ecosystem

Google Cloud
OVH Cloud
Tensor Flow
mongo DB