Evaluation Datasets for LLMs — An overview

Gabriel Gomes, PhD
4 min readJul 22, 2024

--

In the field of Generative AI, especially in modeling problems related to QA and information retrieval, the RAG technique has been extensively used for building applications capable of retrieving data from different sources, aggregating it, and producing a final output answer for a user. For instance, using a RAG application for customer interaction with an e-commerce platform can be extremely useful. Users could ask a chatbot questions like, “What are the best video game consoles for sale that cost below $500 and include at least two controllers and two games?” A good response would be a natural language answer containing basic specifications of about 2–3 options, along with a link or reference so that users can access and purchase the product.

When discussing RAG and QA applications, there are at least two classifications (or types of questions) that exist. Single-hop questions are those for which the answer can be obtained directly, just like the example given in the previous paragraph. Multi-hop questions, however, require a sequence of steps (also known as reasoning steps) to be followed to reach a final result. An example of a multi-hop question would be: “Which scientist won a Nobel Prize for Physics after graduating from the University of Cambridge and what was their contribution?”. The sequence of steps for obtaining the answer would be:

  • Identify scientists who graduated from the University of Cambridge.
  • Determine which of these scientists won a Nobel Prize for Physics.
  • Find the specific contribution for which the Nobel Prize was awarded.

It is important to say that, when it comes to complex questions that require multiple reasoning steps, we can also have questions for which reasoning dynamically changes as partial results are obtained.

In this post, we will present some of the most used datasets employed in the process of evaluation of performance of RAG Architectures. We will cite single-hop as well as multi-hop QA datasets.

Single-hop Questions Datasets

Natural Questions dataset is a question-answering dataset consisting of real, anonymized, aggregated queries issued to the Google search engine. Each question is paired with a Wikipedia page from the top 5 search results. Annotators review the content and mark a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or mark null if no long/short answer is found. The dataset includes 307,373 training examples with single annotations, 7,830 development examples with 5-way annotations, and an additional 7,842 test examples with 5-way annotations. Extensive experiments validate the quality of the data, and an analysis of 25-way annotations on 302 examples provides insights into human variability in the annotation task. The dataset introduces robust metrics for evaluating question-answering systems, demonstrates high human upper bounds on these metrics, and establishes baseline results using competitive methods from related literature.

TriviaQA is a challenging reading comprehension dataset containing over 650,000 question-answer-evidence triples. It includes 95,000 question-answer pairs created by trivia enthusiasts and independently gathered evidence documents, averaging six per question, which provide high-quality distant supervision for answering the questions. Compared to other recently introduced large-scale datasets, TriviaQA features relatively complex, compositional questions, significant syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross-sentence reasoning to find answers. An example of a question from the dataset is: “American Callan Pinckney’s eponymously named system became a best-selling (1980s-2000s) book/video franchise in what genre?”

SQuAD 2.0 came to solve a challenge presented in other datasets before its creation. Basically, for datasets related to QA, question-answer pairs are either focused on only answerable questions, or they generate, automatically, unanswerable questions that are easy to identify in the middle of the dataset. SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable questions, thus refining the quality of the questions that are harder to classify.

Multi-hop Questions Datasets

HotpotQA is a dataset that was created to test the performance of language models in the case of complex reasoning. The dataset contains 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) the creators of the dataset provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) a new type of factoid comparison questions to test QA systems ability to extract relevant facts and perform necessary comparison. When the dataset was released, in 2018, it was complex enough to cause the most advanced models of that time to fail in several cases. Only with the factual supporting contents, the models could perform reasonably well.

MuSiQue was a dataset created with the following question in mind: “Can we create a question answering (QA) dataset that, by construction, requires proper multihop reasoning?”. The authors of the dataset used a bottom-up approach, and selected systematically connected single-hop questions to come up with one multi-hop questions that required answers from k (where k>1) single-hop questions, as contexts for the generation of the final correct answer.

Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10–4) * (13–9) = 24”. Several advanced RAG approaches use partial initial solutions to this task to calculate the remaining numbers during the chain execution. An example is the LLMCompiler strategy. For instance, one can generate 2 numbers and, from there, generate 2 other numbers that could potentially satisfy the condition.

References

Natural Questions: A Benchmark for Question Answering Research

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Know What You Don’t Know: Unanswerable Questions for SQuAD

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

MuSiQue: Multihop Questions via Single-hop Question Composition

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

--

--

Gabriel Gomes, PhD
Gabriel Gomes, PhD

Written by Gabriel Gomes, PhD

Data Scientist and Astrophysicist, interested in using Applied Statistics and Machine Learning / GenAI to solve problems in Finance, Medicine and Engineering.

No responses yet