Advanced RAG Techniques — The Adaptive-RAG strategy

7 min readJul 16, 2024

In a previous post, we talked about the first Advanced RAG technique that I’d like to cover in this series of Posts. Namely, we talked about the Self-RAG. Such approach, as we discussed, is very useful to eliminate the (sometimes not desired) behavior of standard RAG systems, where the LLM uses an user query as input to fetch the most relevant documents from a document source / database. In a standard RAG system, this procedure is done no matter the complexity and content of the query from the user, and the procedure is always exactly the same, the only variable being the retrieved document, which will be the one more semantically similar to the question. Self-RAG (or actually Self-Reflective RAG, as the authors of the original paper themselves present the technique) goes one step further and dynamically decides if fetching data from the documents source, is actually needed. Additionally, the framework also reflects upon the generated answer taking into account the retrieved documents and analyses if there is no hallucination in the generated content.

Although the Self-RAG is already a powerful technique, let us think about a different scenario for which Self-RAG alone might not be the best option. In a real-world application, we know that users will most likely be much more creative than the testers and the test question scenarios to which the GenAI engine was exposed, at validation time. Questions can range from very simple queries to extremely complex questions that need to be broken down into smaller questions until a final answer is obtained. We will hereafter refer to simple questions that are straightforward to answer as single-hop questions, and complex questions that require complex reasoning as multi-hop questions (this nomenclature is extensively used in papers about QA benchmarks and LLM response evaluation in general). Taking into account those two extreme questions, suppose a given user asks two questions:

Question 1: Paris is the capital of which country?

Question 2: If you travel from Paris to Berlin and then from Berlin to Warsaw, which two seas will you be closest to during your journey?

While question 1 is a single-hop question, question 2 requires us to go through the following reasoning steps:

Identify the location of Paris, Berlin, and Warsaw on the map.
Determine the proximity of Paris to nearby seas: Paris is relatively close to the Atlantic Ocean.
Determine the proximity of Berlin to nearby seas: Berlin is close to the Baltic Sea.
Determine the proximity of Warsaw to nearby seas: Warsaw is also close to the Baltic Sea.
Assess the relative proximity to the seas during the journey from Paris to Berlin and Berlin to Warsaw.

We needed 4 steps just to obtain all the data we would need, in the first place, to then do a final comparison of distances and obtain the final result. In a state-of-the-art application where users can (and definitely will) ask questions which complexity are between complexity levels of questions 1 and 2, we cannot risk to use the same LLM, prompt strategy, and set of tools / number of search queries, to answer the question. If we adapted our prompt to always handle very complex questions like question 2, and the user asks a question like question 1, we would definitely waste tokens and provide an unnecessarily large and specific prompt, to a very simple task. On the other hand, we definitely won’t be able to get the right response to question 2, if we just always use a simple prompt that would be able to answer question 1 (as a matter of fact, we wouldn’t even need a custom prompt for answering question 1).

With the basic “Adaptive” thinking above in mind, comes the need to use tools and techniques like the Adaptive-RAG (Jeong et al. 2024). In the next section, we will discuss, on a high-level way, how this technique was originally implemented by the authors.

The Adaptive-RAG Implementation

Jeong et al. discuss, in a very elegant way, in terms of mathematical functions, how we can define a scenario of pure LLM-based retrieval, LLM + single source retrieval, and RAG retrieval. In a more general scenario, we basically have a question q. The answer to the question comes from applying an LLM “operator” or function to q. Supposing we are dealing with a multi-hop question, the answer a would be given by:

a = LLM(q, c, d in D)

where q is the question, d is a document in the database of source documents that can be retrieved (namely, D), and c is the context, which in the case of multi-hop questions, is formed by aggregating the contents of previous documents retrieved, and previous “sub-optimal answers” to the final question. Summarizing in one simple equation:

c = c(d1,d2,d3,…,a1,a2,a3,…)

In the end of the story, all we want is a procedure that should happen somewhere in the middle of q and a, that woud allow us to classify the question complexity and pick the best strategy among different possibilities. In the original work of Jeong et al., the authors train a model specifically to classify the input question complexity in three levels: A, B and C. Questions of complexity A are the simplest and don’t even need the RAG cycle for an accurate answer. In other words, for A, we would have a = LLM(q). For questions in the class B, we would have a = LLM(q,d), and for questions of complexity C we would have a = LLM(q,d,c).

Variations of Adaptive RAG to the Prompt alone

In the section above, we described the implementation of Adaptive-RAG as in the original work of Jeong et al. and references therein. Now I’ll briefly discuss a variation of this method in a specific case of a Graph RAG in which we can have both simple and complex questions, being answered by querying a Neo4j Graph Database. Imagine a simple graph structure composed by the following Nodes:

Person
Article
Year
Country
Affiliation

The graph structure above, when created in the right way (and provided we have the right data), can host data for doing analytics and search papers in a given domain. I’ll go ahead and use the domain of Astronomy as an example. Suppose we populated the given graph with data on authors that published papers from 1990 up to 2024, and we also populated the database with the authors countries of origin and their affiliation at the time they wrote the papers. Now suppose I ask the following question:

Who are the authors of the paper which title is “Discovery of a hot, transiting, Earth-sized planet and a second temperate, non-transiting planet around the M4 dwarf GJ 3473 (TOI-488)”?

The correct answer would be something like: “The authors are J. Kemmer, S. Stock, D. Kossakowski, among others”. Note that this is a very simple question. The query used to obtain this information from the Neo4j graph would look something like:

MATCH (a:Article {title: "Discovery of a hot, transiting, Earth-sized
planet and a second temperate, non-transiting planet around the M4
dwarf GJ 3473 (TOI-488)"})<-[:AUTHORED]-(p:Person) RETURN p.name AS Author

At the same time we had this very simple question for which the graph query is straightforward, we could ask another question:

What is the summary of the results for papers published by J. Kemmer since 2018?

Note that to answer the question above, we have several other questions to be answered before (or actually we have to suppose a reasonable answer for those):

What is a summary, in this context?
Assuming we want a summary for several time windows, what is the window length? Is it a quarterly summary? Or a yearly summary?
What does the user want to know in the end? Is it the list of topics of the papers published by J. Kemmer? Or just a simple count?

Those three points above are just SOME of the possible questions we would need to imagine reasonable answers for (or actually that the LLM would suppose), to generate a reasonable query to fetch the appropriate data in the appropriate format.

A solution to the challenge above would be to implement an Adaptive Prompt to the Graph RAG. In this case, we could have a prompt that would be used if the question complexity was simple (like a straight “Wo was the author” question), and another prompt for complex questions (like summarization, multi-hop questions in general, among others). We could even have more granular prompt separations, like one for time summarization questions, one for author-grouping summarization questions, among others. We would then need to generate few shots, and the effective few shot examples to be used in the actual Cypher generation prompt of a new question, would be selected at runtime, when the user asks a question and we classify the question according to the complexity or domain to which the question belongs (if it’s a summary-related question, a simple question, etc…).

Conclusion

In this post, we have discussed the recently published Adaptive-RAG strategy and also a variation of it where we could use a language model to classify questions in topics (or complexity), and then fetch only specific few-shot examples that would be more aligned to the question asked by the user. In a forthcoming post, we will discuss a more practical implementation of this last variation of Adaptive-RAG.

References

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv 2403.14403 [link here]

Advanced RAG Techniques — The Adaptive-RAG strategy

The Adaptive-RAG Implementation

Variations of Adaptive RAG to the Prompt alone

Conclusion

References

Written by Gabriel Gomes, PhD