RAG Evaluation with Prometheus 2


Β Β Β Β Β Β 

Evaluating the responses of Language Models and LLM-based applications often involves using model-based metrics that do not require ground truth labels. Large proprietary models like GPT-4 and Claude 3 Opus are frequently employed as evaluators and demonstrate a good correlation with human evaluations.

However, relying on closed models poses several challenges:

  • fairness: the training data of these models is unknown.
  • controllability: the behavior of these models can change unpredictably.
  • data privacy: sending data to external providers may raise privacy concerns.
  • affordability: using these powerful models can be expensive.

Using open models for evaluation is an active research area, but their practical use is often limited. They typically do not correlate well with human judgments and lack flexibility.

πŸ”₯ Prometheus 2 is a new family of open-source models designed to address these gaps:

  • two variants, respectively fine-tuned from Mistral-7B and Mixtral8x7B
  • trained on open-source data
  • demonstrate high correlation with human evaluations and proprietary models
  • highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.

In this experimental notebook, we will use Prometheus 2 to evaluate the responses of a RAG pipeline.

First, we will build the RAG pipeline and collect some results. Then, we will code a custom Prometheus Evaluator component for Haystack. Finally, we will initialize three different evaluators and run them in an evaluation pipeline.

Create the RAG pipeline to evaluate

We want to use Prometheus 2 to evaluate the answers generated by a RAG, so we first need to build our RAG Pipeline.

This part is quite similar to the “Evaluating RAG Pipelines” tutorial. Take a look at it for more details.

If you want, you can simply read this section. We will provide the generated data for later evaluation steps.

!pip install haystack-ai datasets sentence-transformers accelerate huggingface_hub bitsandbytes

We will be using a labeled PubMed dataset with questions, contexts and answers. This allows us to use the contexts as Documents and provides the necessary labeled data for some of the evaluation metrics we will define.

In this example, we will use the first 100 rows.

First, let’s fetch the dataset and extract all_documents, all_questions and all_ground_truth_answers.

from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(100))
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc["instruction"] for doc in dataset]
all_ground_truth_answers = [doc["response"] for doc in dataset]

Indexing pipeline

Next, let’s build a simple indexing pipeline and write the documents into a Document Store.

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})
/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:173: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(



Batches:   0%|          | 0/4 [00:00<?, ?it/s]





{'document_writer': {'documents_written': 100}}

RAG pipeline

Now that we have our data ready, we can create a simple RAG pipeline.

In this example, we’ll be using:

  • InMemoryEmbeddingRetriever to retrieve the relevant documents for the query.
  • HuggingFaceLocalGenerator with google/gemma-1.1-2b-it to generate answers to queries. It is a small model, and later we will evaluate the quality of the generated responses based on custom criteria.
import os
from getpass import getpass
from haystack import Pipeline
from haystack.components.builders import AnswerBuilder, PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.generators import HuggingFaceLocalGenerator
from haystack.utils import ComponentDevice


# to access Gemma
# 1. you need a Hugging Face account
# 2. you have to accept Google conditions: https://huggingface.co/google/gemma-1.1-2b-it
# 3. copy your HF token (https://huggingface.co/settings/tokens) and paste it below
os.environ["HF_API_TOKEN"] = getpass("Your Hugging Face token")

generator = HuggingFaceLocalGenerator(
    "google/gemma-1.1-2b-it",
    huggingface_pipeline_kwargs={"device_map": "auto"},
    device=ComponentDevice.from_str("cuda:0"),
)

template = """
<bos><start_of_turn>user
You have to answer the following question based on the given context information only.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:<end_of_turn>
<start_of_turn>model"""

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder",
    SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
rag_pipeline.add_component("generator", generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
rag_pipeline.connect("generator.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
Your Hugging Face tokenΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·





<haystack.core.pipeline.pipeline.Pipeline object at 0x7b1b5c30bdc0>
πŸš… Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: HuggingFaceLocalGenerator
  - answer_builder: AnswerBuilder
πŸ›€οΈ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)
  - generator.replies -> answer_builder.replies (List[str])

You can try the RAG pipeline by asking a question:

question = "Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 **Yes.**

The study found that high levels of procalcitonin (PCT) on postoperative day (POD) 2 were associated with worse outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive care unit, and mechanical ventilation.
response["answer_builder"]
{'answers': [GeneratedAnswer(data=' **Yes.**\n\nThe study found that high levels of procalcitonin (PCT) on postoperative day (POD) 2 were associated with worse outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive care unit, and mechanical ventilation.', query='Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?', documents=[Document(id=9928bb3fd5bfd294a30717df6f590301c0f7c82f65fec5ff9ae7a00ac4956571, content: 'To date, no data is available about procalcitonin (PCT) levels and its relevance to morbidity and gr...', score: 0.7619273394960041), Document(id=2f1be411b8673646b72551e57af84872e39a788c3602c9b22af2ae901eda0da4, content: 'Intrahepatic cholestasis of pregnancy (ICP) is defined by pruritus, elevated total fasting serum bil...', score: 0.4159278001751194), Document(id=b112787486a85ff8086de3f2562d80497bc4cc76bc9d8cf9d3d5b3ee3b663975, content: 'Most hepatocellular carcinomas (HCCs) are associated with cirrhosis. Portal hypertension (PHT) and e...', score: 0.34273266043157447)], meta={})]}

Run the RAG pipeline and save results

Let’s run our RAG pipeline with a set of questions, and make sure to save the data we need for evaluation: questions, ground truth answers and generated answers.

  • In this example, we will use 10 random questions.
  • In the evaluation part, we will not evaluate the retrieved context, so we will not save it. However, you can choose to consider context in the evaluation: as we will see later, evaluation with Prometheus is very customizable.
import random

questions, ground_truth_answers = zip(*random.sample(list(zip(all_questions, all_ground_truth_answers)), 10))
rag_answers = []

for question in list(questions):
    results = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )

    rag_answers.append(results["answer_builder"]["answers"][0].data)
results = {
    "questions": questions,
    "ground_truth_answers": ground_truth_answers,
    "rag_answers": rag_answers,
}
import json

with open("gemma_2b_rag_results.json", "w") as fo:
    json.dump(results, fo)

Evaluation with Prometheus 2

After the preparation work, we can use Prometheus 2 to evaluate the responses generated along several desired axes.

This model expects a prompt like the one below and returns a text containing feedback and a score.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{orig_instruction}

###Response to evaluate:
{orig_response}

###Reference Answer (Score 5):
{orig_reference_answer}

###Score Rubrics:
[{orig_criteria}]
Score 1: {orig_score1_description}
Score 2: {orig_score2_description}
Score 3: {orig_score3_description}
Score 4: {orig_score4_description}
Score 5: {orig_score5_description}

###Feedback:

Create a Prometheus Evaluator component

To perform evaluation, we create a custom Haystack Evaluator component. In Haystack, it is easy to create custom components, and we can implement Prometheus Evaluator with just a few lines of code.

Design choices

Our implementation is hacky and and directed at experimentation, but some choices are worth explaining.

  • the component is inspired and extends our LLMEvaluator, but with specific adaptations for Prometheus

  • init parameters

    • template: Prometheus is highly customizable, so we can easily create different evaluators with different prompt templates
    • inputs: The inputs that the evaluator expects and that it evaluates. They should match those defined in the template.
    • generator: (hacky) allows passing different types of Haystack generators to use the Prometheus model. Examples: HuggingFaceLocalGenerator, LlamaCPPGenerator, etc.
  • run method: for each example to evaluate, the inputs are integrated into the prompt and passed to the model; then the model output is parsed to extract score and feedback. This method returns a dictionary containing an aggregate score, individual_scores and feedbacks.

from typing import Any, Dict, List, Tuple, Type
from haystack import component
from haystack.components.evaluators import LLMEvaluator
from haystack.components.builders import PromptBuilder
from tqdm import tqdm
from numpy import mean as np_mean


ABS_SYSTEM_PROMPT = (
    "You are a fair judge assistant tasked with providing clear, objective feedback based on "
    "specific criteria, ensuring each assessment reflects the absolute standards set for performance."
)


@component
class PrometheusLLMEvaluator(LLMEvaluator):
    def __init__(
        self,
        generator,
        template: str,
        inputs: List[Tuple[str, Type[List]]],
        progress_bar: bool = True,
    ):
        outputs = ["feedback", "score"]
        self.validate_init_parameters(inputs, outputs, [])
        self.inputs = inputs
        self.outputs = outputs

        self._builder = PromptBuilder(template=template)
        self._generator = generator
        self.progress_bar = progress_bar

        component.set_input_types(self, **dict(inputs))

    def _parse_output(self, output):
        feedback, _, score_str = output.rpartition("[RESULT]")
        feedback = feedback.rpartition("###Feedback: [/INST]")[-1].strip()
        score_str = score_str.strip()

        score = None
        if score_str.isdigit() and score_str in ["1", "2", "3", "4", "5"]:
            score = int(score_str)
        return feedback, score

    @component.output_types(score=float, individual_scores=List[float], feedbacks=List[str])
    def run(self, **inputs) -> Dict[str, Any]:
        self.validate_input_parameters(dict(self.inputs), inputs)

        # inputs is a dictionary with keys being input names and values being a list of input values
        # We need to iterate through the lists in parallel for all keys of the dictionary
        input_names, values = inputs.keys(), list(zip(*inputs.values()))
        list_of_input_names_to_values = [dict(zip(input_names, v)) for v in values]

        individual_scores, feedbacks = [], []
        for input_names_to_values in tqdm(list_of_input_names_to_values, disable=not self.progress_bar):
            
            partial_prompt = self._builder.run(**input_names_to_values)["prompt"]
            prompt = f"[INST] {ABS_SYSTEM_PROMPT}\n{partial_prompt} [/INST]"
            
            output = self._generator.run(prompt=prompt)["replies"][0]

            feedback, individual_score = self._parse_output(output)
            if individual_score is not None:
                individual_scores.append(individual_score)
            feedbacks.append(feedback)
        score = np_mean(individual_scores)

        return {
            "score": score,
            "individual_scores": individual_scores,
            "feedbacks": feedbacks,
        }

Load the Prometheus 2 model

We are going to use prometheus-7b-v2.0: the smallest variant of Prometheus 2, which can run on a standard Colab notebook with 8-bit quantization.

In particular, we will use the model via HuggingFaceLocalGenerator, based on the Transformers library.

The generation_kwargs simply replicate those used in the prometheus-eval library. For practical applications, it would be worth experimenting and seeing if there is a better combination of parameters that provides good evaluation performance and reproducibility.

As mentioned earlier, there are several other options for running this open model with Haystack:

  • resource-constrained environments: [LlamaCPPGenerator] (can run on CPU-only environments thanks to the GGUF quantized format; example commented below)
  • in production, with available GPU resources: TGI (via HuggingFaceAPIGenerator), vLLM.
# if you have previously run the RAG pipeline, you will probably need to restart
# the kernel in order to free up GPU memory

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="prometheus-eval/prometheus-7b-v2.0",
    task="text2text-generation",
    huggingface_pipeline_kwargs={
        "device_map": "auto",
        "model_kwargs": {"load_in_8bit": True},
    },
    generation_kwargs={
        "max_new_tokens": 512,
        "temperature": 1.0,
        "do_sample": True,
        "repetition_penalty": 1.03,
        "top_p": 0.9,
    },
)

generator.warm_up()
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(



config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]



Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]



model-00001-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00003-of-00008.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]



model-00004-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00005-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00006-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]



model-00007-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00008-of-00008.safetensors:   0%|          | 0.00/789M [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



tokenizer_config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]


The model 'MistralForCausalLM' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SeamlessM4Tv2ForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].
# UNCOMMENT THE FOLLOWING LINES TO USE llama.cpp
# You can also choose a model with a different quantization: you will lose some quality in exchange with using less resources and being faster

# ! pip install haystack-ai llama-cpp-haystack

# from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
# from huggingface_hub import hf_hub_download

# prometheus_path = hf_hub_download(
#             repo_id="AlekseiPravdin/prometheus-7b-v2_0-gguf", filename="prometheus-7b-v2_0.q8_0.gguf", repo_type="model"
# )

# generator = LlamaCppGenerator(
#     model=prometheus_path,
#     n_ctx=8192,
#     n_batch=512,
# 	  generation_kwargs={"max_tokens": 512, "temperature": 1.0, "do_sample":True, "repeat_penalty": 1.03, "top_p": 0.9},
# )
# generator.warm_up()

Initialize different Prometheus Evaluators

We will define 3 prompt templates and corresponding Prometheus Evaluators:

  • Correctness: Evaluates the generated answer considering both relevance to the question and similarity to the ground truth answer.
  • Response Relevance: Evaluates the generated answer in terms of its relevance to the user’s question.
  • Logical Robustness: Evaluates the logical organization and progression of the response.

As shown, by customizing the prompt model, a diverse range of evaluators can be created.

In general, the first section (Task Description) should be left intact. the only aspect to be changed, as illustrated in the following examples, is whether or not to use a reference answer.

⚠️ Although these evaluator names may be similar to evaluation metrics used in Haystack or other libraries, it is important to understand that they are created specifically for Prometheus and produce scores between 1 and 5. They are not comparable to conceptually similar but differently defined metrics.

correctness_prompt_template = """
###Task Description
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (1 or 2 or 3 or 4 or 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Your task is to evaluate the generated answer against the reference answer for the question: {{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

###Reference Answer (Score 5): {{reference_answer}}

###Score Rubrics:
Score 1: The answer is not relevant to the question and does not align with the reference answer.
Score 2: The answer is relevant to the question but deviates significantly from the reference answer.
Score 3: The answer is relevant to the question and generally aligns with the reference answer but has errors or omissions.
Score 4: The answer is relevant to the question and closely matches the reference answer but is less concise or clear.
Score 5: The answer is highly relevant, fully accurate, and matches the reference answer in both content and clarity.

###Feedback:""".strip()

correctness_evaluator = PrometheusLLMEvaluator(
    template=correctness_prompt_template,
    generator=generator,
    inputs=[
        ("query", List[str]),
        ("generated_answer", List[str]),
        ("reference_answer", List[str]),
    ],
)



response_relevance_prompt_template = """
###Task Description
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Your task is to evaluate whether the generated answer is relevant to the question: {{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

Score 1: The generated answer is off-topic or irrelevant to the question asked.
Score 2: The generated answer includes some relevant information but often contains unrelated details.
Score 3: The generated answer is generally relevant to the question but occasionally includes extraneous or off-topic details.
Score 4: The generated answer is mostly relevant to the question, with minimal unrelated information.
Score 5: The generated answer is highly relevant to the question, addressing it directly and thoroughly without including unnecessary information.

###Feedback:""".strip()

response_relevance_evaluator = PrometheusLLMEvaluator(
    template=response_relevance_prompt_template,
    generator=generator,
    inputs=[("query", List[str]), ("generated_answer", List[str])],
)



logical_robustness_prompt_template = """
###Task Description
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Your task is to evaluate how logically the generated answer for the question is organized, ensuring a clear progression of ideas and arguments that are easy to follow. question:{{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

###Score Rubrics:
Score 1: Disorganized, lacks clear structure, and is difficult to follow.
Score 2: Some structure, but inconsistent and hard to follow due to abrupt transitions.
Score 3: Generally organized with minor flow issues and occasional unclear connections.
Score 4: Well-organized with clear and smooth transitions, easy to follow.
Score 5: Excellently organized with flawless logical flow and seamless transitions.

###Feedback:""".strip()

logical_robustness_evaluator = PrometheusLLMEvaluator(
    template=logical_robustness_prompt_template,
    generator=generator,
    inputs=[("query", List[str]), ("generated_answer", List[str])],
)

Let’s try the logical_robustness_evaluator

query = [
    "Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?",
    "Does poor sleep predict symptoms of depression and disability retirement due to depression?",
]
generated_answer = [
    "As ILC2s are elevated in patients with CRSwNP, they may drive nasal polyp formation in CRS. ILC2s are also linked with high tissue and blood eosinophilia and have a potential role in the activation and survival of eosinophils during the Th2 immune response. The association of innate lymphoid cells in CRS provides insights into its pathogenesis.",
    "Lack of baseline diagnostic interviews; sleep quality based on self-report.",
]


res = logical_robustness_evaluator.run(query=query, generated_answer=generated_answer)
res
{'score': 3.0,
 'individual_scores': [5, 1],
 'feedbacks': ["The generated response is well-organized and presents a clear progression of ideas. It starts by establishing a link between ILC2s and CRSwNP, then describes the role of ILC2s in nasal polyps formation and eosinophilia. The response then draws a conclusion about the pathogenesis of CRS, which is a coherent and logical flow of information. Each sentence builds on the previous, ensuring that the reader is able to follow the argument without confusion. The response maintains a consistent structure and makes smooth transitions between the different points, making it easy to follow. The logical flow and seamless transitions indicate a high level of organization, which aligns well with the score rubric's criteria for a score of 5. Therefore, the response is of high quality in terms of logical organization.",
  'The response provided does not follow the logical structure expected as per the score rubric. There is a lack of clear organization and progression of ideas. The statement is abrupt and does not flow into a logical argument or question, making it difficult to follow the reasoning behind it. It fails to establish a connection between poor sleep, symptoms of depression, and disability retirement due to depression, which is the main focus of the question. The lack of a clear progression of ideas and arguments, and the absence of smooth transitions, makes it challenging to follow the response. Thus, the response fails to meet the criteria for a well-organized and logically flowing answer. Therefore, based on the score rubric, the response is disorganized and lacks a clear structure, making it difficult to follow. So the overall score is 1.']}

Ok, nice!

Evaluation pipeline

We can now add our evaluators to an Evaluation pipeline and run the pipeline with our RAG results.

from haystack import Pipeline

eval_pipeline = Pipeline()
eval_pipeline.add_component("correctness_evaluator", correctness_evaluator)
eval_pipeline.add_component("response_relevance_evaluator", response_relevance_evaluator)
eval_pipeline.add_component("logical_robustness_evaluator", logical_robustness_evaluator)

Let’s download the RAG results. If you have run the RAG pipeline, you can skip the next cell.

# skip this cell if you have run the RAG pipeline before

!wget "https://raw.githubusercontent.com/deepset-ai/haystack-cookbook/main/data/prometheus2_evaluation/gemma_2b_rag_results.json"
import json

with open("gemma_2b_rag_results.json", "r") as fin:
    rag_results = json.load(fin)

questions = rag_results["questions"]
ground_truth_answers = rag_results["ground_truth_answers"]
rag_answers = rag_results["rag_answers"]
eval_results = eval_pipeline.run(
    {
        "correctness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
            "reference_answer": ground_truth_answers,
        },
        "response_relevance_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
        "logical_robustness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
    }
)

Evaluation results

Once we’ve run our evaluation pipeline, we can also create a full evaluation report. Haystack provides an EvaluationRunResult which we can use to display a score_report.

from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "answer": ground_truth_answers,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=eval_results)
evaluation_result.score_report()
                              score
correctness_evaluator           3.9
response_relevance_evaluator    4.3
logical_robustness_evaluator    3.5

In general, in our small sample, Gemma-1.1-2b-it seems to generate relevant answers, but the responses are different from ground truth answers and the logical organization is not optimal.

Let’s inspect the specific metrics in a dataframe.

import pandas as pd

# do not truncate text
pd.set_option("display.max_colwidth", None)

results_df = evaluation_result.to_pandas()
results_df
                                                                                                                                             question  \
0                                                                    Is cDK1 and CDK2 activity a strong predictor of renal cell carcinoma recurrence?   
1           Does metabolic control analysis of the Trypanosoma cruzi peroxide detoxification pathway identify tryparedoxin as a suitable drug target?   
2                            Does promoter variant rs2301228 on the neural cell adhesion molecule 1 gene confer risk of schizophrenia in Han Chinese?   
3                             Does pancreatic polypeptide regulate glucagon release through PPYR1 receptors expressed in mouse and human alpha-cells?   
4                                   Does tetraploid complementation prove pluripotency of induced pluripotent stem cells derived from adipose tissue?   
5  Is osteoprotegerin associated with subclinical left ventricular systolic dysfunction in diabetic hypertensive patients : a speckle tracking study?   
6                                          Is cD30 expression a novel prognostic indicator in extranodal natural killer/T-cell lymphoma , nasal type?   
7                                                        Does mild cognitive dysfunction affect diabetes mellitus control in minority elderly adults?   
8                                                                  Do youth walking and biking rates vary by environments around 5 Louisiana schools?   
9                                        Are human enteroviruses the cause of neurological impairments in children at the Korle-Bu Teaching Hospital?   

                                                                                                                                                                                                                                                                                                                                                                            answer  \
0                                                                                                                                                                                                                                                                                               CDK1SA of tumors and the CDK2SA are both associated with recurrence and prognosis.   
1                                                                                              These quantitative kinetic and metabolic analyses pointed out to TXN as a convenient drug target due to its low catalytic efficiency, high control on the flux of peroxide detoxification and role as provider of reducing equivalents to the two main peroxidases in the parasite.   
2                                                                                                                                                                 Our results provide direct evidence for NCAM1 as a susceptibility gene for schizophrenia, which offers support to a neurodevelopmental model and neuronal connectivity hypothesis in the onset of schizophrenia.   
3                                                                                                                                                                                                       Glucose stimulates PP secretion and PP inhibits glucagon release in mouse pancreatic islets. PP receptors are present in alpha-cells of mouse and human pancreatic islets.   
4                                                                                                                                                                                          We also directed differentiation of iPS cells into chondrocytes, thus adipose-derived iPS cells can be used as models to study chondrogenic differentiation and cartilage regeneration.   
5                                                                                                                                                                                                                                                                           Plasma OPG values could predict subclinical LV systolic dysfunction in diabetic hypertensive patients.   
6                                 Our results showed that expression of CD30 was not related to response to treatment but was an independent prognostic factor for both OS and PFS in ENKTL, nasal type, which suggests a role for CD30 in the pathogenesis of this disease and may support the incorporation of anti-CD30-targeted therapy into the treatment paradigm for ENKTL.   
7  Cognitive dysfunction in the mild range did not seem to affect diabetes mellitus control parameters in this multiethnic cohort of older adults with diabetes mellitus, although it cannot be excluded that cognitive impairment was overcome through assistance from formal or informal caregivers. It is possible that more-severe cognitive dysfunction could affect control.   
8                                                                                                                                                                                                                                                       There is a role and a need for the SRTS program. The environment and other factors influence biking and walking to school.   
9                                                                                                                                           Non-polio enteroviruses and chikugunya virus were found not to be the etiologic agent responsible for the convulsion with neurologic morbidity observed in the Ghanaian children. Investigation for other viral agents is recommended.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         predicted_answer  \
0                                                                                                                                                                                                                                                                                                                                             **Yes, cDK1 and CDK2 activity is a strong predictor of renal cell carcinoma recurrence.**\n\nThe study found that tumors with low CDK2SA-CDK1SA ratio showed significantly better 5-year recurrence-free survival than those with high CDK2SA-CDK1SA ratio.   
1                                                                                                                                                                                                                                                                                                                                                                   **Yes.**\n\nMetabolic control analysis of the Trypanosoma cruzi peroxide detoxification pathway identified tryparedoxin as a suitable drug target, with 90-100% of the pathway flux controlled by TXN-TXNPx and TXN-GPxA redox pairs.   
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                      **Yes.**\n\nThe study found that the promoter variant rs2301228 on the NCAM1 gene is associated with schizophrenia in Han Chinese.   
3                                                                                                                                                                                                                                                                                                                                                             **Yes.**\n\nThe context information indicates that PPYR1 receptors are expressed in the alpha-cell population of mouse and human islets, and that PP release increases upon glucose stimulation, leading to inhibition of glucagon release.   
4                                                                                                                                                                                                                                                                                                                                 **Yes.**\n\nThe study shows that tetraploid complementation successfully generated viable and fertile adult mice derived entirely from reprogrammed ASC, indicating that cell types other than fibroblasts can also be restored to the embryonic level of pluripotency.   
5   **Yes**, osteoprotegerin (OPG) is associated with subclinical left ventricular systolic dysfunction in diabetic hypertensive patients.\n\n**Results:**\n\n* Diabetic hypertensive patients had higher diastolic peak early/early diastolic tissue velocity and lower systolic tissue velocity, GLS, GLS rate systolic, and GLS rate early diastolic than nondiabetic hypertensive patients.\n* OPG was found to be an independent predictor of impaired GLS.\n* Receiver operating characteristic curve analysis revealed that OPG values of > 6.45 (pmol/L) identified the patients with GLS ≀ 18.5.   
6                                                                                                                                                                                                                                                                                                                        **Yes**, CD30 expression is a novel prognostic indicator in extranodal natural killer/T-cell lymphoma, nasal type.\n\nThe study found that CD30 positivity was associated with shorter 5-year OS and PFS in patients with extranodal natural killer/T-cell lymphoma, nasal type.   
7                                                                                                                                                                                                                                                                                                                                                                         The provided text does not contain any information regarding whether mild cognitive dysfunction affects diabetes mellitus control in minority elderly adults, so I am unable to answer this question from the provided context.   
8                                                                                                                                                                                                                                                                                                             The provided text indicates that there is a significant difference in the walking and biking rates between the 5 Louisiana schools. More students walked than biked to school, and the sites with the highest PEDS score had the highest percentage of students who walked/biked to school.   
9                                                                                                                                                                                                                                                                                                                                 The provided text suggests that enteroviruses were detected in cerebrospinal fluid (CSF) samples from children at the Korle-Bu Teaching Hospital, but further studies are needed to establish a causal relationship between enteroviruses and neurological impairments.   

   correctness_evaluator  response_relevance_evaluator  \
0                      4                             3   
1                      5                             5   
2                      3                             5   
3                      5                             4   
4                      4                             5   
5                      5                             5   
6                      5                             5   
7                      1                             1   
8                      3                             5   
9                      4                             5   

   logical_robustness_evaluator  
0                             2  
1                             4  
2                             1  
3                             5  
4                             5  
5                             5  
6                             4  
7                             1  
8                             4  
9                             4  

Since Prometheus provides a feedback for each evaluation, it can be interesting to take a look at them.

eval_results["logical_robustness_evaluator"]["feedbacks"]
['The generated answer, while accurate, does not exhibit a strong logical organization. It simply states the conclusion without a detailed explanation of the underlying data or the process that led to this conclusion. Furthermore, there are no transition phrases or linking sentences that would guide the reader from one point to the next, making it hard to follow the progression of ideas.\n\nDespite the absence of transition phrases or linking sentences, the answer maintains a certain degree of coherence, but this coherence could be greatly improved by providing more context or by elaborating on the reasons behind the observed relationship between cDK1 and CDK2 activity and renal cell carcinoma recurrence. For example, it could explain why a lower CDK2SA-CDK1SA ratio is associated with better survival outcomes.\n\nTherefore, although the response contains the necessary information, it lacks the clear progression of ideas and arguments that would make it easy to follow. In contrast, a response with excellent organization would include detailed explanations, smoothly transitioning from one point to the next, and a clear progression of ideas. The absence of these elements in the response means that it falls short of the expected standard of logical organization and flow. \n\nSo the overall score is 2.',
 "This response provides a concise answer to the question, effectively stating that metabolic control analysis identified tryparedoxin as a suitable drug target. It succinctly describes the pathway's regulation by the redox pairs TXN-TXNPx and TXN-GPxA, which demonstrates the clear flow of information and aligns well with the expected logical structure of the response.\n\nHowever, while this response is accurate and follows a logical progression, it lacks the detail found in more elaborate answers. For instance, it does not explicitly mention the percentage of pathway flux controlled by these redox pairs, which could have added more depth to the answer. Moreover, the explanation could be further refined to improve the clarity of the connections between the different components.\n\nDespite these minor drawbacks, the response maintains a well-organized structure and smooth transitions, making it easy to follow. The information is presented in a logical sequence, which helps to enhance the overall coherence of the answer.\n\nIn light of the criteria outlined in the score rubric, the response fulfills the expectations for a score of 4. It presents the information in a logical, coherent, and well-structured manner, although there is room for improvement in terms of detail and connection clarity.\n\nSo the overall score is 4.",
 'This response is disorganized and lacks clear structure. It does not provide any details or reasoning behind its claim. The transition from presenting the study to confirming the link between the promoter variant and schizophrenia is abrupt and lacks any logical flow. The reader is left without any explanation or understanding of how the study reached its conclusion, making it difficult to follow. This failure to elaborate or substantiate the claim results in a response that does not meet the required standards for logical organization. Thus, it can be concluded that this response falls short in fulfilling the criteria outlined in the score rubric.',
 "This response succinctly affirms the question, with a clear structure that logically follows from the contextual information provided. The connection between the increase in PP release and the subsequent inhibition of glucagon release is presented in a logical sequence that's easy to understand. There are no abrupt transitions or unclear connections in the response, ensuring a smooth flow from one point to another. This response effectively demonstrates a coherent and seamless logical progression of ideas. As per the scoring rubric, it shows that the answer is not only well-organized but also has clear and smooth transitions. Therefore, it adheres to the criteria of being easy to follow and exhibiting a flawless logical flow, hence it is awarded a score of 5.",
 'The response provided has shown an excellent logical flow, which aligns with the requirements of the score rubric. The answer directly addresses the question, presenting a clear and well-structured argument. It starts with an affirmation of the initial question, then elaborates on the process of tetraploid complementation, explaining the implications in terms of the pluripotency of the induced pluripotent stem cells. The transition from the premise to the conclusion is seamless, making it easy for the reader to follow the logic. There are no abrupt transitions or disorganized elements in the response, which further contributes to its overall clarity and coherence. So, the response fully meets the criteria of a score 5, as it is excellently organized with flawless logical flow and seamless transitions.',
 'The generated answer demonstrates an excellent logical flow and seamless transitions between the information provided, which aligns with the highest score of the rubric. It effectively establishes the connection between osteoprotegerin (OPG) and subclinical left ventricular systolic dysfunction in diabetic hypertensive patients. The response succinctly presents the results of the speckle tracking study and clearly defines how OPG is an independent predictor of impaired GLS. The conclusion drawn from the receiver operating characteristic curve analysis reinforces the connection between OPG levels and the identification of patients with GLS ≀ 18.5. The organization of the response is logical and clear, making it easy for readers to follow the line of reasoning from the introduction to the conclusion. Therefore, according to the score rubric, the response is well-structured and offers an in-depth and coherent understanding of the topic. So the overall score is 5.',
 'When evaluating the organization of the response, the primary concern is the clarity and smoothness of the progression of ideas. In this case, the answer is well-structured with a clear statement followed by supporting evidence from the study. The transition from stating the conclusion ("Yes, CD30 expression is a novel prognostic indicator") to presenting the study\'s findings is smooth and logical.\n\nHowever, there is room for improvement in terms of providing more context to the initial statement. By mentioning what the study found in relation to the 5-year OS and PFS, the answer could have provided a more thorough explanation that directly relates to the question. The connection between the initial statement and the supporting evidence is clear but could benefit from a more explicit explanation.\n\nDespite these minor areas for improvement, the response does a good job at presenting the argument in a coherent manner. Therefore, according to the score rubric, which emphasizes the clear progression of ideas and arguments, this response meets the requirements for a score of 4. The overall structure is sound, but a slightly more detailed presentation of the evidence would have elevated it to a perfect score.',
 "Upon reviewing the generated response, it is evident that there is a lack of content that directly addresses the posed question. The response fails to provide any argument or information related to the relationship between mild cognitive dysfunction and diabetes mellitus control in minority elderly adults. The text's structure is disorganized, as it merely states the inability to answer, without any attempt to explore the question or provide a logical flow of information. This makes it very difficult for the reader to follow or understand the content. Consequently, according to the score rubric, this response would be evaluated as having a score of 1, as it is disorganized, lacks clear structure, and is difficult to follow.",
 'This response presents a straightforward statement that walks through the central point in a linear fashion. The progression of ideas is logical and easy to follow, as it moves from indicating a difference in rates to specifying the relationship between these rates and the PEDS score. However, the response does not provide the depth of analysis that could have made the argument more robust. For example, it does not delve into why this significant difference might exist or consider any potential variables that could affect these rates. The logical flow and clarity of the response meet the requirements of a score of 4, but it falls short of achieving a score of 5 due to the absence of more detailed explanations or comparisons. Therefore, while the response is generally organized, it could benefit from further elaboration and a more comprehensive analysis of the data. So the overall score is 4.',
 "The generated response presents a clear and structured argument, aligning with the scoring rubric's criteria for a score of 4. The response successfully establishes the presence of enteroviruses in CSF samples, acknowledging the need for more research to definitively link these viruses to neurological impairments. The argument flows logically, from acknowledging the initial findings to suggesting the necessity of additional studies. This structure, along with the smooth transitions between ideas, facilitates easy comprehension, which is a critical aspect as per the score rubric. However, the response could be further enhanced by providing a bit more context or detail about the research process or the specific types of neurological impairments associated with the virus, which might elevate it to a score of 5. Nevertheless, the response does not present abrupt transitions, nor does it contain unclear connections, which are key factors negatively impacting the scoring."]

πŸ“š References