๐ŸŽ„ Advent of Haystack solutions are here, explore them now!

Tutorial: Query Classifier


This tutorial is based on Haystack 1.x (farm-haystack). If you’re using Haystack 2.x (haystack-ai), refer to the Haystack 2.x tutorials or Haystack Cookbook.

For more information on Haystack 2.x, read the Haystack 2.0 announcement.

  • Level: Intermediate
  • Time to complete: 15 minutes
  • Nodes Used: TransformersQueryClassifier, InMemoryDocumentStore, BM25Retriever, EmbeddingRetriever, FARMReader
  • Goal: After completing this tutorial, you will have learned about TransformersQueryClassifier and how to use it in a pipeline.

Overview

One of the great benefits of using state-of-the-art NLP models like those available in Haystack is that it allows users to state their queries as plain natural language questions: rather than trying to come up with just the right set of keywords to find the answer to their question, users can simply ask their question in much the same way that they would ask it of a (very knowledgeable!) person.

But just because users can ask their questions in “plain English” (or “plain German”, etc.), that doesn’t mean they always will. For instance, users might input a few keywords rather than a complete question because they don’t understand the pipeline’s full capabilities or are so accustomed to keyword search. While a standard Haystack pipeline might handle such queries with reasonable accuracy, for a variety of reasons we still might prefer that our pipeline is sensitive to the type of query it is receiving, so that it behaves differently when a user inputs, say, a collection of keywords instead of a question. For this reason, Haystack comes with built-in capabilities to distinguish between three types of queries: keyword queries, interrogative queries(questions), and statement queries.

In this tutorial you will learn how to use TransformersQueryClassifier to branch your Haystack pipeline based on the type of query it receives. Haystack comes with two out-of-the-box query classification schemas:

  1. Keyword vs. Question/Statement โ€” routes a query into one of two branches depending on whether it is a full question/statement or a collection of keywords.

  2. Question vs. Statement โ€” routes a natural language query into one of two branches depending on whether it is a question or a statement.

With TransformersQueryClassifier, it’s also possible to route query based on custom cases with custom classification models or zero-shot classification.

With all of that explanation out of the way, let’s dive in!

Preparing the Colab Environment

Installing Haystack

To start, install the latest release of Haystack with pip:

%%bash

pip install --upgrade pip
pip install farm-haystack[colab,elasticsearch,inference]

# Install these to allow pipeline visualization
# apt install libgraphviz-dev
# pip install pygraphviz

Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

from haystack.telemetry import tutorial_running

tutorial_running(14)

Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack. Example log message: INFO - haystack.utils.preprocessing - Converting data/tutorial1/218_Olenna_Tyrell.txt Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

Trying out TransformersQueryClassifier

Before integrating query classifier into the pipeline, test it out on its own and see what it actually does. First, initiate a simple, out-of-the-box keyword vs. question/statement TransformersQueryClassifier. By default, it uses shahrukhx01/bert-mini-finetune-question-detection model.

from haystack.nodes import TransformersQueryClassifier

keyword_classifier = TransformersQueryClassifier()

Now feed some queries into this query classifier. Test with one keyword query, one interrogative query, and one statement query. Note that you don’t need to use any punctuation, such as question marks, for the query classifier to make the right decision.

queries = [
    "Arya Stark father",  # Keyword Query
    "Who was the father of Arya Stark",  # Interrogative Query
    "Lord Eddard was the father of Arya Stark",  # Statement Query
]

Below, you can see what the classifier does with these queries: it correctly determines that “Arya Stark father” is a keyword query and sends it to branch 2. It also correctly classifies both the interrogative query “Who was the father of Arya Stark” and the statement query “Lord Eddard was the father of Arya Stark” as non-keyword queries, and sends them to branch 1.

import pandas as pd

k_vs_qs_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = keyword_classifier.run(query=query)
    k_vs_qs_results["Query"].append(query)
    k_vs_qs_results["Output Branch"].append(result[1])
    k_vs_qs_results["Class"].append("Question/Statement" if result[1] == "output_1" else "Keyword")

pd.DataFrame.from_dict(k_vs_qs_results)

Next, you will illustrate a question vs. statement TransformersQueryClassifier. For this task, you need to define a new query classifier. Note that this time you have to explicitly specify the model as the default model of TransformersQueryClassifier is for keyword vs. question/statement classification.

question_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/question-vs-statement-classifier")

queries = [
    "Who was the father of Arya Stark",  # Interrogative Query
    "Lord Eddard was the father of Arya Stark",  # Statement Query
]

q_vs_s_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = question_classifier.run(query=query)
    q_vs_s_results["Query"].append(query)
    q_vs_s_results["Output Branch"].append(result[1])
    q_vs_s_results["Class"].append("Question" if result[1] == "output_1" else "Statement")

pd.DataFrame.from_dict(q_vs_s_results)

And as you see, the question “Who was the father of Arya Stark” is sent to branch 1, while the statement “Lord Eddard was the father of Arya Stark” is sent to branch 2. This means you can have your pipeline treat statements and questions differently.

Pipeline with Keyword vs. Question/Statement Query Classifiers

Now you will create question-answering (QA) pipeline with keyword vs. question/statement query classifier.

1) Initialize the DocumentStore and Write Documents

You’ll start creating a pipeline by initializing a DocumentStore, which will store the Documents. As Documents, you’ll use pages from the Game of Thrones wiki.

Initialize InMemoryDocumentStore, fetch Documents and index them to the DocumentStore:

from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import fetch_archive_from_http, convert_files_to_docs, clean_wiki_text

document_store = InMemoryDocumentStore(use_bm25=True)

doc_dir = "data/tutorial14"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt14.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

got_docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(got_docs)

2) Initialize Retrievers, Reader and QueryClassifier

Your pipeline will be a simple Retriever-Reader QA pipeline, but the Retriever choice will depend on the type of query received: keyword queries will use a sparse BM25Retriever, while question/statement queries will use the more accurate but also more computationally expensive EmbeddingRetriever.

Now, initialize both Retrievers, Reader and QueryClassifier:

from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader, TransformersQueryClassifier

bm25_retriever = BM25Retriever(document_store=document_store)

embedding_retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)
document_store.update_embeddings(embedding_retriever, update_existing_embeddings=False)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
keyword_classifier = TransformersQueryClassifier()

3) Define the Pipeline

As promised, the question/statement branch output_1 from the query classifier is fed into the EmbeddingRetriever, while the keyword branch output_2 from the same classifier is fed into the BM25Retriever. Both of these retrievers are then fed into our reader. Our pipeline can thus be thought of as having something of a diamond shape: all queries are sent into the classifier, which splits those queries into two different retrievers, and those retrievers feed their outputs to the same reader.

from haystack.pipelines import Pipeline

transformer_keyword_classifier = Pipeline()
transformer_keyword_classifier.add_node(component=keyword_classifier, name="QueryClassifier", inputs=["Query"])
transformer_keyword_classifier.add_node(
    component=embedding_retriever, name="EmbeddingRetriever", inputs=["QueryClassifier.output_1"]
)
transformer_keyword_classifier.add_node(
    component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"]
)
transformer_keyword_classifier.add_node(
    component=reader, name="QAReader", inputs=["BM25Retriever", "EmbeddingRetriever"]
)

# To generate a visualization of the pipeline, uncomment the following:
# sklearn_keyword_classifier.draw("sklearn_keyword_classifier.png")

4) Run the Pipeline

Below, you can see how this choice affects the branching structure: the keyword query “arya stark father” and the question query “Who is the father of Arya Stark?” generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.

from haystack.utils import print_answers

# Useful for framing headers
equal_line = "=" * 30

# Run only the dense retriever on the full sentence query
res_1 = transformer_keyword_classifier.run(query="Who is the father of Arya Stark?")
print(f"\n\n{equal_line}\nQUESTION QUERY RESULTS\n{equal_line}")
print_answers(res_1, details="minimum")
print("\n\n")

# Run only the sparse retriever on a keyword based query
res_2 = transformer_keyword_classifier.run(query="arya stark father")
print(f"\n\n{equal_line}\nKEYWORD QUERY RESULTS\n{equal_line}")
print_answers(res_2, details="minimum")

Above you saw a potential use for keyword vs. question/statement classification: you might choose to use a less resource-intensive retriever for keyword queries than for question/statement queries. But what about question vs. statement classification?

Pipeline with Question vs. Statement Query Classifier

To illustrate one potential use for question vs. statement classification, you will build a pipeline that looks as follows:

  1. The pipeline will start with a retriever that every query will go through.
  2. The pipeline will end with a reader that only question queries will go through.

In other words, your pipeline will be a retriever-only pipeline for statement queriesโ€”given the statement “Arya Stark was the daughter of a Lord”, all you will get back are the most relevant documentsโ€”but it will be a retriever-reader pipeline for question queries.

To make things more concrete, your pipeline will start with a Retriever, which is then fed into a QueryClassifier that is set to do question vs. statement classification. The QueryClassifier’s first branch, which handles question queries, will then be sent to the Reader, while the second branch will not be connected to any other nodes. As a result, the last node of the pipeline depends on the type of query: questions go all the way through the Reader, while statements only go through the Retriever.

Now, define the pipeline. Keep in mind that you don’t need to write Documents to the DocumentStore again as they are already indexed.

1) Define a new TransformersQueryClassifier

question_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/question-vs-statement-classifier")

2) Define the Pipeline

transformer_question_classifier = Pipeline()
transformer_question_classifier.add_node(component=embedding_retriever, name="EmbeddingRetriever", inputs=["Query"])
transformer_question_classifier.add_node(
    component=question_classifier, name="QueryClassifier", inputs=["EmbeddingRetriever"]
)
transformer_question_classifier.add_node(component=reader, name="QAReader", inputs=["QueryClassifier.output_1"])

# To generate a visualization of the pipeline, uncomment the following:
# transformer_question_classifier.draw("transformer_question_classifier.png")

2) Run the Pipeline

And here are the results of this pipeline: with a question query like “Who is the father of Arya Stark?”, you obtain answers from a Reader, and with a statement query like “Arya Stark was the daughter of a Lord”, you just obtain documents from a Retriever.

from haystack.utils import print_documents

# Useful for framing headers
equal_line = "=" * 30

# Run the retriever + reader on the question query
res_1 = transformer_question_classifier.run(query="Who is the father of Arya Stark?")
print(f"\n\n{equal_line}\nQUESTION QUERY RESULTS\n{equal_line}")
print_answers(res_1, details="minimum")
print("\n\n")

# Run only the retriever on the statement query
res_2 = transformer_question_classifier.run(query="Arya Stark was the daughter of a Lord.")
print(f"\n\n{equal_line}\nSTATEMENT QUERY RESULTS\n{equal_line}")
print_documents(res_2)

Custom Use Cases for Query Classifiers

TransformersQueryClassifier is very flexible and also supports other options for classifying queries. For example, you may be interested in detecting the sentiment of the query or classifying the topics. You can do this by loading a custom classification model from the Hugging Face Hub or by using zero-shot classification.

Custom classification model vs zero-shot classification

  • Traditional text classification models are trained to predict one of a few “hard-coded” classes and require a dedicated training dataset. In the Hugging Face Hub, you can find many pre-trained models, maybe even related to your domain of interest.
  • Zero-shot classification is very versatile: by choosing a suitable base transformer, you can classify the text without any training dataset. You just have to provide the candidate categories.

Using custom classification models

For this use case, you can use a public model available in the Hugging Face Hub. For example, if you want to classify the sentiment of the queries, you can choose an appropriate model, such as cardiffnlp/twitter-roberta-base-sentiment.

In this case, the labels parameter must contain a list with the exact model labels. The first label corresponds to output_1, the second label to output_2, and so on. For cardiffnlp/twitter-roberta-base-sentiment, labels are 0 -> Negative; 1 -> Neutral; 2 -> Positive.

labels = ["LABEL_0", "LABEL_1", "LABEL_2"]

sentiment_query_classifier = TransformersQueryClassifier(
    model_name_or_path="cardiffnlp/twitter-roberta-base-sentiment",
    use_gpu=True,
    task="text-classification",
    labels=labels,
)
queries = [
    "What's the answer?",  # neutral query
    "Would you be so lovely to tell me the answer?",  # positive query
    "Can you give me the damn right answer for once??",  # negative query
]
import pandas as pd

sent_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = sentiment_query_classifier.run(query=query)
    sent_results["Query"].append(query)
    sent_results["Output Branch"].append(result[1])
    if result[1] == "output_1":
        sent_results["Class"].append("negative")
    elif result[1] == "output_2":
        sent_results["Class"].append("neutral")
    elif result[1] == "output_3":
        sent_results["Class"].append("positive")

pd.DataFrame.from_dict(sent_results)

Using zero-shot classification

You can also perform zero-shot classification by providing a suitable base transformer model and defining the classes the model should predict.

For example, you may be interested in whether the user query is related to music or cinema. In this case, the labels parameter is a list containing the candidate classes.

labels = ["music", "cinema"]

query_classifier = TransformersQueryClassifier(
    model_name_or_path="typeform/distilbert-base-uncased-mnli",
    use_gpu=True,
    task="zero-shot-classification",
    labels=labels,
)
queries = [
    "In which films does John Travolta appear?",  # cinema
    "What is the Rolling Stones first album?",  # music
    "Who was Sergio Leone?",  # cinema
]
import pandas as pd

query_classification_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = query_classifier.run(query=query)
    query_classification_results["Query"].append(query)
    query_classification_results["Output Branch"].append(result[1])
    query_classification_results["Class"].append("music" if result[1] == "output_1" else "cinema")

pd.DataFrame.from_dict(query_classification_results)

Congratulations! ๐ŸŽ‰ Youโ€™ve learned how TransformersQueryClassifier works and how you can use it in a pipeline.