From RAGs to riches: How LLMs can be made more reliable for knowledge intensive tasks

In this article, we explain how RAGs make LLMs more reliable, efficient, trustworthy and flexible, by diving into the various components of its architecture.


Matthieu Boussard, PhD
Head of R&D at Craft AI

All articles




Large Language Models (LLMs) are formidable tools, capable of providing convincing answers to any question. Unfortunately, this ability to convince is a double-edged sword. Under the guise of seemingly logical reasoning, the facts stated may be completely false. This may be due to the absence of documents in the learning corpus, or to their novelty, rarity or confidentiality.

What's more, in some cases where the LLM user is liable, it is necessary to be able to give the sources used to answer the question. This is the case with contractual clauses, for example, where the answer is less important than the way in which it is arrived at.

These use cases are grouped together under the term "knowledge-intensive applications". In this context, the main aim is to access knowledge, as opposed to translation or reformulation applications, for example. So we want to use the comprehension and synthesis power of LLMs while guaranteeing the exploitation of a controlled knowledge base, all the while citing these sources. This best of all worlds exists, and is based on RAGs: "Retrieval Augmented Generators".

A RAG is an LLM-based system for fetching information from a user-controlled corpus and then synthesizing a response from selected elements.

Main components

A RAG is a multi-component architecture. Before presenting the complete architecture, let's zoom in on the main parts.


The first concept of a RAG is embedding. A computer cannot directly manipulate words or phrases. It is necessary to transform them into numerical values on which it is possible to calculate. It is important that this transformation maintains semantic proximity, i.e. that two concepts that are semantically close are numerically close.

Fortunately, there are pre-trained models for this task, BERT-derived sentence embeddings. This operation can be quite computationally intensive, even if the models are smaller than LLMs. Recently, models have even been proposed specifically to improve inference times.

Document Chunking

The corpus containing the knowledge to be exploited by the RAG cannot be used directly. It has to be broken down into small chunks (with potential overlap). As we have seen, these chunks cannot be used directly and must be transformed via embedding. In addition, it's important to keep the metadata around the chunks, e.g. the source file from which it's taken, the chapter within the document, the date of last update, etc.

The size of the corpus can be large, and storing and retrieving these chunks poses new problems, which is where vector DB comes in.

Specialized information systems: Vector databases

To answer a given question, a RAG calculates an embedding of the question and searches for relevant chunks. Since embeddings preserve the notion of semantic distance, finding relevant documents is in fact equivalent to finding documents that are close, in terms of distance, in the embedding space. We can therefore formalize the problem of finding relevant documents as "find the k closest chunks in the embedding space". This operation must be inexpensive, even with a large corpus. This is where vectorDBs come in.

To solve these problems, it is no longer possible to use relational databases such as mySQL, or no-SQL databases such as REDIS. The way in which information is stored in these databases is not adapted to the types of queries made in a RAG.

Fortunately, there are databases designed specifically for certain tasks, such as TimescaleDB for time series, or postGIS for geographic data. Here, it's the vector DBs that will solve our problem. They store embeddings in an optimized way, so that it's possible to find the L vectors closest to a given vector. These include players such as ChromaDB, Qdrant and PGVector.

At the end of this stage, the RAG has retrieved from the database the k chunks most relevant to the question asked. These are then transferred to the LLM to provide the final answer.


The user's question and the chunks are assembled to form a prompt, which is then supplied as input to an LLM, such as LLaMa 2, Falcon, etc., in the conventional way. It should be noted that the task of creating a response from elements supplied to the LLM is simpler than having to generate everything. So, even with a "small" LLM (7b), we're already getting very relevant results.

RAG architecture

With the preceding elements, we can present the complete architecture of a RAG. First, the VectorDB is populated with the embeddings and metadata of the chunks from the document base. The query arrives and its embedding is calculated. The K most relevant chunks are extracted from the VectorDB. The query and chunks are combined to create a prompt. This prompt is transmitted to an LLM, which provides the answer.

This response can also be enhanced with the sources used, thanks to the metadata associated with the chunks.

Technically, this general architecture is a classic one, and we've been able to identify the various building blocks required to construct it. In practice, Python libraries such as Langchain or LLamaIndex enable us to efficiently select and combine the LLM, embedding model and Vector DB, as well as their parameters.


The RAG architecture therefore offers many advantages over the use of an LLM alone. These include

  • Reliability: by using a user-controlled knowledge base, the probability of hallucination is greatly reduced.
  • Trust: The sources used for generation are given. If the user has a high level of responsibility, he can refer directly to the sources.
  • Efficiency: The LLM used to generate the response can be much smaller than a GPT-4 for results. This architecture avoids finetuning a model on its corpus. What's more, even with a small number of documents, it's possible to have a relevant RAG.
  • Flexibility: Modifying the knowledge base is as simple as adding documents to the vectorDB.

This architecture is already very efficient with generic bricks. Their performance can be further enhanced by fine-tuning. The LLM can be re-trained on a question-document-answer database, which can for example be generated by a larger model (such as GPT-4) and thus improve the quality of the generated answer.

It is also possible to modify this architecture to incorporate the ability to use tools. These are referred to as "agents". In this context, before directly answering the question posed, an LLM can choose to call on a tool, such as an online search, an API, a calculator, etc. In this way, the LLM can itself choose the query to be carried out in the Vector DB. In this way, the LLM itself can choose the query to be carried out in the Vector DB, and can combine the RAG with other tools, such as online search.

While RAGs offer many advantages, they remain machine learning systems. Their performance needs to be carefully measured, and their use constantly monitored, not least because of the dynamic nature of the knowledge base. The metrics to be observed are the subject of in-depth study, but will include, for example, performance metrics on the quality of results, as well as safety metrics such as those relating to toxicity.

Want to find out more about how to deploy a domain-specific LLM , and run a RAG test with your own data? Make an appointment here.

A platform compatible with the entire ecosystem

Google Cloud
OVH Cloud
Tensor Flow
mongo DB