Tutorial: Build an Extractive QA Pipeline
Last Updated: January 15, 2025
- Level: Beginner
- Time to complete: 15 minutes
- Components Used:
ExtractiveReader
,InMemoryDocumentStore
,InMemoryEmbeddingRetriever
,DocumentWriter
,SentenceTransformersDocumentEmbedder
,SentenceTransformersTextEmbedder
- Goal: After completing this tutorial, you’ll have learned how to build a Haystack pipeline that uses an extractive model to display where the answer to your query is.
This tutorial uses the latest version of Haystack 2.x (
haystack-ai
). For more information on Haystack 2.0, read the Haystack 2.0 announcement or visit the Haystack Documentation.
Overview
What is extractive question answering? So glad you asked! The short answer is that extractive models pull verbatim answers out of text. It’s good for use cases where accuracy is paramount, and you need to know exactly where in the text that the answer came from. If you want additional context, here’s a deep dive on extractive versus generative language models.
In this tutorial you’ll create a Haystack pipeline that extracts answers to questions, based on the provided documents.
To get data into the extractive pipeline, you’ll also build an indexing pipeline to ingest the Wikipedia pages of Seven Wonders of the Ancient World dataset.
Preparing the Colab Environment
#Installation
%%bash
pip install haystack-ai accelerate "sentence-transformers>=3.0.0" "datasets>=2.6.1"
Knowing youβre using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.
from haystack.telemetry import tutorial_running
tutorial_running(34)
Load data into the DocumentStore
Before you can use this data in the extractive pipeline, you’ll use an indexing pipeline to fetch it, process it, and load it into the document store.
The data has already been cleaned and preprocessed, so turning it into Haystack Documents
is fairly straightfoward.
Using an InMemoryDocumentStore
here keeps things simple. However, this general approach would work with
any document store that Haystack 2.0 supports.
The SentenceTransformersDocumentEmbedder
transforms each Document
into a vector. Here we’ve used
sentence-transformers/multi-qa-mpnet-base-dot-v1
. You can substitute any embedding model you like, as long as you use the same one in your extractive pipeline.
Lastly, the DocumentWriter
writes the vectorized documents to the DocumentStore
.
from datasets import load_dataset
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.readers import ExtractiveReader
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
documents = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]
model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
document_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=SentenceTransformersDocumentEmbedder(model=model), name="embedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
indexing_pipeline.connect("embedder.documents", "writer.documents")
indexing_pipeline.run({"documents": documents})
Build an Extractive QA Pipeline
Your extractive QA pipeline will consist of three components: an embedder, retriever, and reader.
-
The
SentenceTransformersTextEmbedder
turns a query into a vector, using the same embedding model defined above. -
Vector search allows the retriever to efficiently return relevant documents from the document store. Retrievers are tightly coupled with document stores; thus, you’ll use an
InMemoryEmbeddingRetriever
to go with theInMemoryDocumentStore
. -
The
ExtractiveReader
returns answers to that query, as well as their location in the source document, and a confidence score.
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.readers import ExtractiveReader
from haystack.components.embedders import SentenceTransformersTextEmbedder
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
reader = ExtractiveReader()
reader.warm_up()
extractive_qa_pipeline = Pipeline()
extractive_qa_pipeline.add_component(instance=SentenceTransformersTextEmbedder(model=model), name="embedder")
extractive_qa_pipeline.add_component(instance=retriever, name="retriever")
extractive_qa_pipeline.add_component(instance=reader, name="reader")
extractive_qa_pipeline.connect("embedder.embedding", "retriever.query_embedding")
extractive_qa_pipeline.connect("retriever.documents", "reader.documents")
Try extracting some answers.
query = "Who was Pliny the Elder?"
extractive_qa_pipeline.run(
data={"embedder": {"text": query}, "retriever": {"top_k": 3}, "reader": {"query": query, "top_k": 2}}
)
ExtractiveReader
: a closer look
Here’s an example answer:
[ExtractedAnswer(query='Who was Pliny the Elder?', score=0.8306006193161011, data='Roman writer', document=Document(id=bb2c5f3d2e2e2bf28d599c7b686ab47ba10fbc13c07279e612d8632af81e5d71, content: 'The Roman writer Pliny the Elder, writing in the first century AD, argued that the Great Pyramid had...', meta: {'url': 'https://en.wikipedia.org/wiki/Great_Pyramid_of_Giza', '_split_id': 16}
The confidence score ranges from 0 to 1. Higher scores mean the model has more confidence in the answer’s relevance.
The Reader sorts the answers based on their probability scores, with higher probability listed first. You can limit the number of answers the Reader returns in the optional top_k
parameter.
By default, the Reader sets a no_answer=True
parameter. This param returns an ExtractedAnswer
with no text, and the probability that none of the returned answers are correct.
ExtractedAnswer(query='Who was Pliny the Elder?', score=0.04606167031102615, data=None, document=None, context=None, document_offset=None, context_offset=None, meta={})]}}
.0.04606167031102615
means the model is fairly confident the provided answers are correct in this case. You can disable this behavior and return only answers by setting the no_answer
param to False
when initializing your ExtractiveReader
.
Wrapping it up
If you’ve been following along, now you know how to build an extractive question answering pipeline with Haystack 2.0. π Thanks for reading!
If you liked this tutorial, there’s more to learn about Haystack 2.0:
- Classifying Documents & Queries by Language
- Generating Structured Output with Loop-Based Auto-Correction
- Preprocessing Different File Types
To stay up to date on the latest Haystack developments, you can sign up for our newsletter.