Demystifying LLMs: A Dive into Generative Models

Exploring LLMs: Unveiling profound text generation via embeddings & attention, while navigating challenges in data use.


Matthieu Boussard, PhD
Head of R&D at Craft AI

All articles




Generative models are a family of machine learning models whose purpose, as their names suggest, is to generate data. This is quite different from their use in a predictive framework.

In a predictive framework, models are trained on past data in order to learn a decision boundary or a value. In generative models, models are trained to understand the structure of the data, with the aim of producing data that resembles the training data.

Where a predictive model needs the values of the input attributes to give its prediction, a generative model needs context. This can take a variety of forms, from extracts of elements to be generated, constraints to be respected, to simply noise (which acts as a random "seed").

In the previous definitions, no details were given on the nature of the data it is capable of generating via these generative models. Actually, models have been developed to generate images, music, Excel files and so on. In the following, we'll focus on language models, and more specifically on LLMs (Large Language Models). Having a model capable of generating text is not really new, but what makes them a real revolution compared with conventional approaches is their versatility and their quality. This has been made possible by advances in computer science research, notably by the invention of Transformer-type neural networks in 2017, and then the ability to learn from colossal volumes of data.

Large Language Model: LLM

While we see many very impressive applications of LLMs, it's important to remember that these models are designed for just one task: that of predicting the next element in a given sentence. This may seem surprising, but whether it's code generation, text summarization, medical diagnosis or a philosophy dissertation, it's all based on the ability to generate the next part of a given sentence. The model will complete the input sentence, then re-use this partially extended sentence as new input, and so on until the desired length is reached.

Most of recent LLMs are based on two main concepts: embeddings and the attention mechanism.


Computers cannot manipulate words and their semantics directly. They can, however, compute very efficiently on numbers. We could associate each word with a unique number, but this would lead to a very inefficient system. What we'd like is for the numerical values associated with words to retain information. For example, if the word "man" is at a certain distance from the word "woman", this distance should be the same between the words "king" and "queen". This is what embeddings enable. This is a trained part of the model which, for a given word, returns a computer-usable representation that preserves the relationships between words (and even their relative position in the sentence for certain embedding type).

Embeddings: Translating to a Lower-Dimensional Space


An LLM, such as GPT, is trained to perform a single task: predicting the next word given the beginning of a sentence. A simple approach might be to group together all the words preceding the word to be predicted, in order to create a context and learn from a training corpus which is the most likely next word. This approach appears to generate a lot of noise, as many uninformative words are taken into account. This is where the attention mechanism comes in. It learns which elements are relevant to the prediction of the next word and focuses on those, masking the others. Thanks to a particular type of embedding (positional embedding), position and semantics are taken into account. In this way, only relevant elements appear in the decision-making process, enabling the models to learn efficiently how to predict the next word to be generated. This is clearly illustrated with images. It is indeed necessary to find a mask, then apply it to the image in order to isolate the important elements. Note that, as with text, several attention systems need to be implemented in parallel, as it may be necessary to concentrate on several parts of the image in order to interpret it correctly.

MIT 6.S191 (2022): Recurrent Neural Networks and Transformers

An LLM is a stack of modules

To understand the size of a language model, we need to look at a simpler example of deep learning.

The above network is made up of an input layer, through which input attribute values can be given, 2 hidden layers for learning correlations between input values, and finally an output layer for retrieving results.

Each neuron (the circles) aggregates the values of the previous layer (the x's) multiplied by an associated weight (w) and performs a calculation on this sum to produce a value that can then be used by the following layers of the network. It is these weights w that are learned during the training phase and are also called parameters.

The way in which the neurons are assembled together (their number, the links between them) is called an architecture. The attention mechanism is therefore an architecture. It's possible to build a slightly larger architecture containing an attention mechanism, the Transformers. And if we "stack" the Transformer blocks, and add the embedding block, we almost get an LLM architecture like LLaMA! It's the number of "stacked" layers that defines the size of the model, expressed in number of parameters.

Google Research : Attention Is All You Need


The training costs, data requirements and know-how needed to create a foundation model are such that it's impossible for a non-specialist company to create one. Nevertheless, while the largest models are still only accessible via APIs (such as GPT4 or BARD), more and more models are appearing in open source. These models are available in a range of sizes (from a few billion to several hundred billion parameters), enabling the choice of LLM to be adapted to the specific use case. Indeed, putting an oversized LLM into production generates excessive energy consumption and hosting constraints.

LLMs are capable of handling a wide variety of problems. When you're only interested in solving a specific business problem, you can adapt a smaller LLM (like LLaMA2 7b, for example) to this use case. In this way, its performance on this specific use case can be made equivalent to that of a larger LLM. These techniques are known as fine-tuning. An important point about fine-tuning is that it can be carried out on standard hardware (a few GPUs), at a more reasonable cost and time. If you don't have sufficiently powerful hardware, you can also choose to lose precision in the value of the model's parameters and fit them into the GPU's memory (via quantization). The open-source distribution of models and the development of a software ecosystem now allow LLMs to be democratized and adopted by industry.

A Survey of Large Language Models


Navigating through the complexities of Large Language Models (LLMs), we uncover the depth of their text generation capabilities, rooted in mechanisms like embeddings and attention. While their ability to produce diverse and coherent text showcases AI's leaps in technology, LLMs also present notable challenges in areas like data labeling, deployment, and ethical use. As we stride forward in the AI domain, marrying technological advancements with ethical and responsible application will be crucial. The implementation and fine-tuning of LLMs demand a conscientious approach, ensuring their benefits permeate various domains while safeguarding ethical and operational standards.

A platform compatible with the entire ecosystem

Google Cloud
OVH Cloud
Tensor Flow
mongo DB