๐ŸŽ„ Advent of Haystack solutions are here, explore them now!
Maintained by deepset

Integration: Unstructured File Converter

Component to easily convert files and directories into Documents using the Unstructured API

Authors
deepset

Overview

Component for the Haystack (2.x) LLM framework to convert files and directories into Documents using the Unstructured API.

Unstructured provides ETL tools for LLMs, extracting text and other information from various file formats. See supported file types for more details.

Installation

To install the Unstructured File Converter, run:

pip install unstructured-fileconverter-haystack

Usage

Connecting to the Unstructured API

Hosted API

The Unstructured API is available in both free and paid versions: Unstructured Serverless API or Free Unstructured API.

For the Free Unstructured API, the API URL is https://api.unstructured.io/general/v0/general. For the Unstructured Serverless API, find your unique API URL in your Unstructured account.

Note that the API keys for free and paid versions are not interchangeable.

Set the Unstructured API key as an environment variable:

export UNSTRUCTURED_API_KEY=your_api_key

Local API (Docker)

You can run a local instance of the Unstructured API using Docker:

docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

When initializing the component, specify the localhost URL:

from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general")

Running Unstructured File Converter

In isolation

import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]

In a Haystack Pipeline

import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})