Evaluating the Quality of a RAG Application

So, you’ve written a question and answer application using RAG. How good is its quality? Let’s get into that in this article.

Three Metrics of RAG Quality

Three metrics are popularly used to measure the quality of a RAG application.

  • Answer Relevance – How good is the answer for a particular question. This is perhaps the most important metric.
  • Context Relevance – How good is the retrieved context text for a particular question. This measures the quality of the retrieval process. We look at this metric when attempting to improve retrieval using a different distance function or chunking strategy.
  • Groundedness – This measures how well the answer provided by the LLM is supported by facts present in the context. A low score will indicate hallucination.

These metrics are usually calculated in a range of 0-10.

Ways to Calculate these Metrics

There are several ways you can measure these metrics.

Human Evaluation – Human users, preferably domain experts, can manually evaluate the results and assign scores. This process is time consuming and expensive.

Traditional NLP Metrics – We can compute traditional metrics like BLEU and ROUGE scores. Unfortunately, they are not terribly meaningful and may not do a very good job evaluating your RAG application.

Using a Language Model – Most intriguingly, we can ask a language model to evaluate these scores by using a few cleverly designed prompts. For cheap and fast evaluation you can use a small model like BERT. For a better evaluation you can use an LLM like ChatGPT, Gemini or Llama. Evidence suggests that LLMs can match domain experts in evaluating these scores.

Using LLM to Calculate Metrics

We can use some interesting prompts to ask an LLM to calculate these metrics. Truelens has built a library around that. You can view the prompts they use in this file. Here’s an example prompt to calculate the context relevance metric.

You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.
    Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. 

    A few additional scoring guidelines:

    - Long CONTEXTS should score equally well as short CONTEXTS.

    - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.

    - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.

    - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.

    - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.

    - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.

    - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.

    - Never elaborate.


QUESTION: When was Andrzej Tadeusz Bonaventura Kościuszko born?

CONTEXT: 

 “I had no idea who this was, but I just really like the national park system,” said Erin Sully, of Minneapolis, during her stop. 

If he had a highlight reel, it would go like this: Born Andrzej Tadeusz Bonaventura Kościuszko in 1746, he studied engineering and military strategy. He sailed to America in 1776 after an ill-fated bid to marry a noblewoman. In Philadelphia, he impressed Franklin and became a colonel in the Continental Army.

RELEVANCE:

Gemini will respond to this with the score 9.

Try replacing the context with something obviously irrelevant to the question.

Back in Philly, he crashed at 3rd and Pine—now the memorial site—for five months starting 
in late 1797. He grew close to then-Vice President Jefferson, who dubbed him “as pure a son of 
liberty as I have ever known.” Congress belatedly paid Kosciuszko for his war service. With 
Jefferson’s aid, he wrote a will calling for his estate to help free slaves—a wish never granted. 
Back across the ocean he went, dying 19 years later in Switzerland at 71.

You will get a score of 3.

Conclusion

I find it most interesting that we can use LLMs to measure the quality of an LLM based application. You can use a library like Truelens or just write your own by borrowing and modifying their prompts.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.