Integration: Apify

Extract data from the web and automate web tasks using Apify-Haystack integration.

Authors

apify

GitHub Repo PyPI Package

Overview
Installation
Usage
- ApifyDatasetFromActorCall on its own
- ApifyDatasetFromActorCall in a RAG pipeline
License

Overview

Apify is a web scraping and data extraction platform. It helps automate web tasks and extract content from e-commerce websites, social media (Facebook, Instagram, TikTok), search engines, online maps, and more. Apify provides more than two thousand ready-made cloud solutions called Actors.

Installation

Install the Apify-haystack integration:

pip install apify-haystack

Usage

Once installed, you will have access to more than two thousand ready-made apps called Actors at Apify Store

Load a dataset from Apify and convert it to a Haystack Document
Extract data from Facebook/Instagram and save it in the InMemoryDocumentStore
Crawl websites, scrape text content, and store it in the InMemoryDocumentStore
Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering

The integration implements the following components (you can find their usage in these examples):

ApifyDatasetLoader: Load a dataset created by an Apify Actor
ApifyDatasetFromActorCall: Call an Apify Actor, load the dataset, and convert it to Haystack Documents
ApifyDatasetFromTaskCall: Call an Apify task, load the dataset, and convert it to Haystack Documents

You need to have an Apify account and an Apify API token to run this example. You can start with a free account at Apify and get your Apify API token.

In the examples below, specify apify_api_token and run the script.

ApifyDatasetFromActorCall on its own

Use Apify’s Website Content Crawler to crawl a website, scrape text content, and convert it to Haystack Documents. You can browse other Actors in Apify Store

In the example below, the text content is extracted from https://haystack.deepset.ai/. You can control the number of crawled pages using maxCrawlPages parameter. For a detailed overview of the parameters, please refer to Website Content Crawler.

The script should produce the following output (truncated to a single Document):

Document(id=a617d376*****, content: 'Introduction to Haystack 2.x)
Haystack is an open-source framework fo...', meta: {'url': 'https://docs.haystack.deepset.ai/docs/intro'}

from dotenv import load_dotenv
import os
from haystack import Document

from apify_haystack import ApifyDatasetFromActorCall

# Use APIFY_API_TOKEN from .env file or set it
load_dotenv()
os.environ["APIFY_API_TOKEN"] = "YOUR APIFY_API_TOKEN"

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 3,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}


def dataset_mapping_function(dataset_item: dict) -> Document:
    """Convert an Apify dataset item to a Haystack Document
    
   Website Content Crawler returns a dataset with the following output fields:
    {
        "url": "https://haystack.deepset.ai",
        "text": "Haystack is an open-source framework for building production-ready LLM applications",
    }
    """
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})


actor = ApifyDatasetFromActorCall(
    actor_id=actor_id,
    run_input=run_input,
    dataset_mapping_function=dataset_mapping_function
)
print(f"Calling the Apify Actor {actor_id} ... crawling will take some time ...")
print("You can monitor the progress at: https://console.apify.com/actors/runs")

dataset = actor.run().get("documents")

print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:")
for d in dataset:
    print(d)

ApifyDatasetFromActorCall in a RAG pipeline

Follow 🧑‍🍳 Cookbook: Extract and use website content for question answering with Apify-Haystack integration for the full runnable example.

Retrieval-Augmented Generation (RAG): Extracting text content from a website and using it for question answering. Answer questions about the https://haystack.deepset.ai website using the extracted text content.

Expected output:

question: "What is haystack?"
answer: Haystack is an open-source framework for building production-ready LLM applications

In addition to the APIFY_API_TOKEN, you also need to specify OPENAI_API_KEY to run this example.


import os

from dotenv import load_dotenv
from haystack import Document, Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.utils.auth import Secret

from apify_haystack import ApifyDatasetFromActorCall

# Set APIFY_API_TOKEN and OPENAI_API_KEY here or use it from .env file
load_dotenv()
os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 1,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}


def dataset_mapping_function(dataset_item: dict) -> Document:
    """Convert an Apify dataset item to a Haystack Document
    
   Website Content Crawler returns a dataset with the following output fields:
    {
        "url": "https://haystack.deepset.ai",
        "text": "Haystack is an open-source framework for building production-ready LLM applications",
    }
    """
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})


apify_dataset_loader = ApifyDatasetFromActorCall(
    actor_id=actor_id,
    run_input=run_input,
    dataset_mapping_function=dataset_mapping_function
)

# Components
print("Initializing components...")
document_store = InMemoryDocumentStore()

docs_embedder = OpenAIDocumentEmbedder()
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIGenerator(model="gpt-3.5-turbo")

# Load documents from Apify
print("Crawling and indexing documents...")
print("You can visit https://console.apify.com/actors/runs to monitor the progress")
docs = apify_dataset_loader.run()
embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)

# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

question = "What is haystack?"

print("Running pipeline ... ")
response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})

print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")

# Other questions
examples = [
    "Who created Haystack?",
    "Are there any upcoming events or community talks?",
]

for example in examples:
    response = pipe.run({"embedder": {"text": example}, "prompt_builder": {"question": example}})
    print(f"question: {question}")
    print(f"answer: {response['llm']['replies'][0]}")

License

apify-haystack is distributed under the terms of the Apache-2.0 license.