Advanced RAG Techniques — The Self-RAG strategy
With the recent advances in the field of Autonomous Agents in the field of Generative AI, we have seen the rise of modern and powerful techniques aiming at improving LLM response quality, as well as assure that we are extracting the best that we can from LLMs for production-level applications. Although those two main objectives may seem the same, that is not always the case. Imagine the following scenario: (1) You would like to use a LLM to build an Agent that has the ability to perform Cypher queries (that is, fetch data from a Graph DB that supports the Cypher query language, like Neo4j or Memgraph), and obtain results to answer user’s questions. At the same time, you also want your LLM to be able to (2) answer user’s questions even when the user does not ask anything that would need to be queried from the Graph DB for providing the answer. Classical approaches like providing an Agent access to do graph queries, and always execute a given pre-established sequence of steps in the same order, could lead to an excellent answer for case (1), although they could fail (and indeed cause application level failure) for case (2). Additionally, we would spend a significant amount of tokens for the second case, just to form the appropriate prompt that would contain the graph schema (nodes and relationships information) to correctly form the Cypher query, execute it, and obtain the answer. In the end, the Cypher query would actually fail (we are supposing the user’s simple question’s answer is not contained in the graph data, in this hypothesis), and we would have spent the tokens for the prompt generation anyways, even though this would have been a useless step in our chain.
In the field of RAG-powered applications, several advances have been made with the objective to improve models response by advanced prompt strategies, as well as prevent other features such as hallucinations and unnecessary reasoning. The combination of those techniques can lead to 1) faster responses, especially for simple questions that don’t need advanced RAG approaches, and 2) more efficient chain executions, that would optimize cost and latency, at the same time that those approaches would not decrease the model answer quality (or they would actually even improve it!)
In this post (which is part of a series of posts on Advanced RAG techniques), I will introduce the approach called Self-RAG, presented for the first time, to the best of my knowledge, in the paper by Asai et al. (2023), see the references section of this post.
Self RAG
The concepts and results presented in this section are taken from the discussions in the original paper presenting the Self RAG Approach, mainly the paper of Asai et al. (2023).
I will go forward and borrow some of the words in the beginning of the paper cited above: Indiscriminately retrieving and incorporating a fixed number of retrieved passages to answer user’s questions, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation.
The Self RAG paper presents a new approach where the LLM Agent is able to:
- search for data using external sources using search tools like DuckDuckGo or Tavily (or even DB retrieval tools),
- analyse the responses from the search step and evaluate which results are relevant or not for the user query,
- do a new search step if needed, in case the user’s question can only be partially answered from the first search step, and
- join / combine all the results from the relevant search results, to provide a final answer.
The Figure 1 illustrates an example of how the algorithm works in practice.
We will summarize some of the most critical concepts that are necessary for the reasoning, and refer the reader to the original paper for a more detailed explanation of the underlying concepts of the approach.
The most critical concepts introduced in Self-RAG that actually allow for the decision making of whether new retrievals are needed or not (or even to find out if a passage is relevant or not), are the reflection tokens. The reflection tokens are further divided in 4 types, which we will explain in detail below:
- Retrieval-on-demand: This refers to the “Retrieve” boxes inside Figure 1. This reflection token decides if continuation of the reasoning needs factual grounding (in other words, this decides if we need or not to search for external sources again). This token is also composed of the “continue to use evidence”. If positive, this means that we will carry the answers of previous retrievals to the next steps. This is what allowed the response in the example of Figure 1 to be a combination of steps 1 and 4. We can see that answer from (1) gives a partial answer for the full question from the user, while answer (4) provides even more details, which are needed for a complete answer to the user question.
- Relevant: Represented by the “IsRel” boxes inside Figure 1. This aspect indicates whether the evidence provides useful information (Relevant) or not (Irrelevant).
- Supported: Represented by the “IsSup” boxes inside Figure 1. This aspect judges how much information in the output is entailed by the evidence. Those can be evaluate in three scales: Fully supported, Partially supported, and No support / Contradictory. See Appendix A.1 of the original paper for more details and citations to other references therein.
- Useful: Represented by the “IsUse” boxes inside Figure 1. It evaluates whether the response is a helpful and informative answer to the query, independently from whether it is in fact factual or not. The scoring of Useful goes from 1 to 5.
In the Appendix D of the original paper, we can clearly see how each one of the reflection tokens appear in the process of the full response generation. An interesting aspect of this setup is that, although we may need to iteratively generate new search steps depending on the answer quality at a given point of the sequence generation (we refer to sequence generation as everything that is generated after the user query is provided), some steps can be executed in parallel, while other steps need to be formed just after a given partial search response is obtained and made available to the Agent runtime.
Summary and Conclusions
The Self-RAG approach can be thought of as a “RAG on demand” approach, in which the capability of self-reflection is also available. Additionally to the pure LLM response, Self-RAG is also trained to generate extra tokens called Reflection Tokens. Due to the generation of those extra tokens, one can tailor the LLM behavior at test time by using the reflection tokens. As a result of this more advanced strategy, one can achieve better results for some tasks with smaller models, when compared to using pure simple prompt strategies with larger models.
Practical Implementations
As Self-RAG has already been around for some months (or actually almost a year at the time this post is being written), some implementations are available using LangGraph and LangChain. Look at the Mistral AI cookbook, as well as the LangChain AI repo example and the friendly guide to implement Self-RAG using LangGraph, provided in this page.
References
Self-RAG: Learning to retrieve, generate, and critique through Self-Reflection. arXiv 2310.11511 [link here]
Teaching language models to support answers with verified quotes. arXiv 2203.11147 [link here]
WebGPT: Browser-assisted question-answering with human feedback. arXiv 2112.09332 [link here]
Automatic Evaluation of Attribution by Large Language Models. arXiv 2305.06311 [link here]
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models. arXiv 2212.08037 [link here]