Advanced RAG Techniques — The LLMCompiler Approach
In previous posts, I have exposed some of the main concepts related to three advanced RAG techniques: The Corrective RAG, Adaptive-RAG and Self-RAG techniques. The main similarity among the three approaches is that one can define a given base flow of action, and the Agent using the LLM approach will go throw the necessary steps and perform necessary actions, not necessarily in the same order or following the same flow, given two different questions. This kind of adaptive reasoning has paved the way to even more complex approaches.
Although sophisticated methods can already be built by keeping in mind the three advanced techniques above, most (if not all of them) depend on sequential execution of tasks. In the case of a simple flow where the function calls of the Agent do not present high latency, such sequential reasoning approach may be enough to solve the problems at hand. However, there are situations where the execution of tools and function calls does not need to be made sequentially, but in parallel. Take as an example, the following question: ‘What is the capital of France? And the capital of Germany?’. If we use a simple Reason-and-Act-like approach, like ReAct, the flow would be: 1) find the capital of France using a tool like Duckduckgo, for instance, 2) do the same, but with Germany, and 3) condense the answers and provide one final answer to the user.
In cases like the one mentioned above, where we can have a certain level of parallelization, the LLMCompiler is a great way to provide such functionality.
The Implementation
We will describe the implementation on a more high-level way, without specific code logic. However, the official implementation exposed in the original paper of LLMCompiler is available in the SqueezeAILab GitHub repository. An implementation in Python using LangGraph is also provided here. We will now describe the high-level logic of the algorithm.
As it was mentioned before, when we have some degree of parallelization, a Plan-and-Act approach could identify the calls that can be made in parallel, and then separate them so they can be made in parallel. This is exactly what the first component of the LLMCompiler approach does. It is called the Function Calling Planner (see the second box in Figure 1 below). It is responsible for generating a sequence of tasks to be executed along with any dependency among them. The Planner automatically identifies the necessary tasks, their input arguments, as well as their inter-dependencies. Thus, the Planner generates a structure like a DAG (Directed Acyclic Graph) of task dependencies. To represent the dependencies of a given task on another, the Planner uses symbols like $1 (this means that an input of step N depends on the output of Task 1).
To make sure that the appropriate tasks are generated and their dependencies are tracked with precision, it is very important that each tool that the planner can access, has a very well defined description (both semantically as well as in terms of the input and output structures of each tool).
After the DAG of tasks is generated by the Function Calling Planner (hereafter the Planner), we have the Task Fetching Unit component. This component (the third box in Figure 1) is responsible for fetching tasks to the Executor as soon as they are ready for parallel execution. The Task Fetching Unit is also responsible for replacing the placeholders of task dependencies (like $1, $2, $3, which are the placeholders generated by the Planner) with the actual values that are obtained after the responses for those tasks executions are obtained.
Finally, the Executor (fourth box in Figure 1) is the component responsible for asynchronously executing the tasks fetched from the Task Fetching Unit. The only reason for which the Executor can execute all tasks passed to it, in parallel, is because the Task Fetching Unit already verified that the tasks made visible to the Executor are not dependent on one another, in the first place. Since each task may need intermediate outcomes to be re-used by their given tool, it is mandatory that each task execution unit has a memory allocated to it, as it can be seen in the Executor representation boxes with the memory word right next to the tools.
It is worth mentioning that, sometimes, the initial DAG may assume a given sequence of steps and input-output pairs. However, as results are obtained from the initial DAG, it may be necessary to re-plan and adjust the tasks for obtaining the right result. Although an if-else approach could be sufficient in case the DAG was simple, it is a simplistic approach when we talk about complex questions. Thus, the LLMCompiler approach also allows for a communication between the intermediate outcomes of the Executor and the Planner components. If replanning is necessary, a new DAG will be generated from the Planner, which will be fed again to the Task Fetching Unit. This new structure will then be used to provide parallel tasks to be executed by the Executor. This cycle is repeated until a response for the question is obtained, or until one reaches the limit number of calls to the LLM (this resource is used in production-level applications to prevent infinite loops and excessive token usage).
Tests and Benchmark
As we already discussed in a separate post here, there are a lot of datasets that can be used to evaluate advanced and complex LLM and Agent architectures. For both single-hop and multi-hop question types, one can use such datasets to evaluate both accuracy as well as latency and other aspects. For a thorough description of what tests were performed with LLMCompiler in details, check out the original paper, Section 5. The authors split the performance evaluation in three different types of question sets: 1) embarrasingly parallel function calling, 2) parallel function calling with dependencies, and 3) parallel function calling with replanning (a fourth example related to interactive decision making is also provided in the original paper, although it is a little different in terms of setup than the three previous cases).
The results for the third case, that is, the parallel function calling with replanning case, are the most interesting ones. The Game of 24 was used in this example. In this case, the task at hand is to generate the number 24 using a set of four numbers and basic arithmetic operations. The LLMCompiler approach was compared to the original Tree-of-Thoughts (hereafter ToT approach) used for the same task. The increase in accuracy levels was not significant (1–2 percent for GPT and Llama-2 models), but the latency decreased in almost 300% for GPT models, and almost 210% for Llama-2 70B,, by using LLMCompiler, when compared to ToT. This is due to the fact that the major part of the tasks can be executed in parallel, thus allowing for a bigger number of arithmetic operations to be executed at the same time.
Conclusion
The LLMCompiler was presented based on the original paper of Kim et al. (2024), and we exposed some of the main findings, especially for the case of complex tasks that need replanning along the way, like The Game of 24. Although the gain in accuracy is not significant when compared to other sophisticated approaches (like sequential ReAct or raw GPT models with function calling), the most impressive results are related to the reduction in latency, which according to Table 1 of the original paper, varies from 40% to almost 300% in certain cases. LLMCompiler is a great fit for situations where one has a lot of tools and tasks that can be executed in parallel!