Speaker Diarization with AssemblyAI

_{Last Updated:
September 24, 2024}

📚 This cookbook has an accompanying article with a complete walkthrough “ Level up Your RAG Application with Speaker Diarization”

LLMs excel with text data, answering complex questions without manual reading or searching. When dealing with audio or video, providing transcription is key. Transcription captures spoken content of the audio or video, but in multi-speaker recordings, it may miss non-verbal information and fail to convey speaker count or individual remarks. Therefore, to maximize the LLM’s potential with such recordings, Speaker Diarization is essential!

In this example, we’ll build a RAG application with speaker labels for audio files. This application will use Haystack and speaker diarization models by AssemblyAI.

📚 Useful Sources:

Integration: AssemblyAI

Install the Dependencies

%%bash
pip install haystack
pip install assemblyai-haystack
pip install "sentence-transformers>=3.0.0"
pip install "huggingface_hub>=0.23.0"
pip install --upgrade gdown

Download The Audio Files

We extracted the audio from youtube videos and saved them in a Google Drive Folder for you: https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W?usp=drive_link

You can run the code below to download the audio files to this colab notebook under “Files” tab on the left bar.

!gdown https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W -O "/content" --folder

Retrieving folder contents
Processing file 12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ Netflix_Q4_2023_Earnings_Interview.mp3
Processing file 1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD Panel_Discussion.mp3
Processing file 1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m- Working_From_Home_Debate.mp3
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ
To: /content/Netflix_Q4_2023_Earnings_Interview.mp3
100% 39.1M/39.1M [00:00<00:00, 67.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD
To: /content/Panel_Discussion.mp3
100% 21.8M/21.8M [00:00<00:00, 60.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m-
To: /content/Working_From_Home_Debate.mp3
100% 4.45M/4.45M [00:00<00:00, 34.8MB/s]
Download completed

Add Your API Keys

Enter the API keys from AssemblyAI and Hugging Face:

import os
from getpass import getpass

ASSEMBLYAI_API_KEY = getpass("Enter your ASSEMBLYAI_API_KEY: ")
os.environ["HF_API_TOKEN"] = getpass("HF_API_TOKEN: ")

Enter your ASSEMBLYAI_API_KEY: ··········
HF_API_TOKEN: ··········

Index Speaker Labels to Your DocumentStore

Build a pipeline to generate speaker labels and index them into a DocumentStore with their embeddings. In this pipeline, you need:

InMemoryDocumentStore: to store your documents without external dependencies or extra setup
AssemblyAITranscriber: to create speaker_labels for the given audio file and convert them into Haystack Documents
DocumentSplitter: to split your documents into smaller chunks
SentenceTransformersDocumentEmbedder: to create embeddings for each document using sentence-transformers models
DocumentWriter: to write these documents into your document store

Note: The speaker information will be saved in the meta of the Document object

from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from assemblyai_haystack.transcriber import AssemblyAITranscriber
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice

speaker_document_store = InMemoryDocumentStore()
transcriber = AssemblyAITranscriber(api_key=ASSEMBLYAI_API_KEY)
speaker_splitter = DocumentSplitter(
    split_by = "sentence",
    split_length = 10,
    split_overlap = 1
)
speaker_embedder = SentenceTransformersDocumentEmbedder(device=ComponentDevice.from_str("cuda:0"))
speaker_writer = DocumentWriter(speaker_document_store, policy=DuplicatePolicy.SKIP)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=transcriber, name="transcriber")
indexing_pipeline.add_component(instance=speaker_splitter, name="speaker_splitter")
indexing_pipeline.add_component(instance=speaker_embedder, name="speaker_embedder")
indexing_pipeline.add_component(instance=speaker_writer, name="speaker_writer")

indexing_pipeline.connect("transcriber.speaker_labels", "speaker_splitter")
indexing_pipeline.connect("speaker_splitter", "speaker_embedder")
indexing_pipeline.connect("speaker_embedder", "speaker_writer")

Give an audio_file_path and run your pipeline

audio_file_path = "/content/Panel_Discussion.mp3" #@param ["/content/Netflix_Q4_2023_Earnings_Interview.mp3", "/content/Working_From_Home_Debate.mp3", "/content/Panel_Discussion.mp3"]
indexing_pipeline.run(
    {
        "transcriber": {
            "file_path": audio_file_path,
            "summarization": None,
            "speaker_labels": True
        },
    }
)

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:92: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]



config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]



sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]





{'transcriber': {'transcription': [Document(id=427e56c68f0440dd8f51643ba52e2a2b60c739f4fc42ddab7207fb428da4492d, content: 'I want to start with you, Amy, because I know you, obviously at Shell have had AI as part of the wor...', meta: {'transcript_id': 'c053a806-6826-40ac-a6bc-95cab9b4cb8a', 'audio_url': 'https://cdn.assemblyai.com/upload/188cdd14-ff33-4468-81cb-e2c337674fc5'})]},
 'speaker_writer': {'documents_written': 64}}

RAG Pipeline with Speaker Labels

Build a RAG pipeline to generate answers to questions about the recording. Ensure that speaker information (provided through the metadata of the document) is included in the prompt for the LLM to distinguish who said what. For this pipeline, you need:

SentenceTransformersTextEmbedder: To create an embedding for the user query using sentence-transformers models
InMemoryEmbeddingRetriever: to retrieve top_k relevant documents to the user query
PromptBuilder: to provide a RAG prompt template with instructions to be filled with retrieved documents and the user query
HuggingFaceAPIGenerator: to infer models served through Hugging Face free Serverless Inference API or Hugging Face TGI

The LLM in the example ( mistralai/Mixtral-8x7B-Instruct-v0.1) is a gated model. Make sure you have access to the model.

from haystack import Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice

prompt = """
You will be provided with a transcription of a recording with each sentence or group of sentences attributed to a Speaker by the word "Speaker" followed by a letter representing the person uttering that sentence. Answer the given question based on the given context.
If you think that given transcription is not enough to answer the question, say so.

Transcription:
{% for doc in documents %}
  {% if doc.meta["speaker"] %} Speaker {{doc.meta["speaker"]}}: {% endif %}{{doc.content}}
{% endfor %}
Question: {{ question }}
<|end_of_turn|>
Answer:
"""

retriever = InMemoryEmbeddingRetriever(speaker_document_store)
text_embedder = SentenceTransformersTextEmbedder(device=ComponentDevice.from_str("cuda:0"))
answer_generator = HuggingFaceAPIGenerator(
    api_type="serverless_inference_api",
    api_params={"model": "mistralai/Mixtral-8x7B-Instruct-v0.1"},
    generation_kwargs={"max_new_tokens":500})
prompt_builder = PromptBuilder(template=prompt)

speaker_rag_pipe = Pipeline()
speaker_rag_pipe.add_component("text_embedder", text_embedder)
speaker_rag_pipe.add_component("retriever", retriever)
speaker_rag_pipe.add_component("prompt_builder", prompt_builder)
speaker_rag_pipe.add_component("llm", answer_generator)

speaker_rag_pipe.connect("text_embedder.embedding", "retriever.query_embedding")
speaker_rag_pipe.connect("retriever.documents", "prompt_builder.documents")
speaker_rag_pipe.connect("prompt_builder.prompt", "llm.prompt")

Test RAG with Speaker Labels

question = "What are each speakers' opinions on building in-house or using third parties?" # @param ["What are the two opposing opinions and how many people are on each side?", "What are each speakers' opinions on building in-house or using third parties?", "How many people are speaking in this recording?" ,"How many speakers and moderators are in this call?"]

result = speaker_rag_pipe.run({
    "prompt_builder":{"question": question},
    "text_embedder":{"text": question},
    "retriever":{"top_k": 10}
})
result["llm"]["replies"][0]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Batches:   0%|          | 0/1 [00:00<?, ?it/s]





" Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker B believes that the decision depends on whether the task is part of the company's core IP or not. They also mention that the build versus buy decision is too simplistic, as there are other options like partnering or using third-party platforms. Speaker C takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker B thinks that AI is not magic and requires hard work, process, and change management, just like any other business process."