Skip to main content

Overview

This service is optional. document-api and its dependency opensearch are only needed for knowledge base / RAG features. Voice assistants work fully without it. Use make up-all-with-knowledge (Docker) to start with knowledge base support.
The document-api is the knowledge backend for the Rapida platform. It processes documents from PDF, Word, CSV, and other formats into searchable vector embeddings and full-text search indices. At call time, assistant-api queries this service to inject relevant knowledge context into the LLM prompt.

Port

9010 — HTTP (FastAPI / uvicorn)

Language

Python 3.11+ FastAPI + Celery

Storage

PostgreSQL assistant_db Redis (Celery broker) OpenSearch (vectors + text)
Document processing is asynchronous. When a document is uploaded, the API immediately returns a document_id with status: processing. Text extraction, chunking, and embedding generation are handled by Celery workers in the background.

Components

The processing pipeline runs as a Celery task after each upload. The stages are sequential and the document status is updated at each step.
StageLibraryConfigurable
Text extractionformat-specific (see table below)No
Chunkingcustom splitterCHUNK_SIZE, CHUNK_OVERLAP
Embeddingssentence-transformersEMBEDDINGS_MODEL
Full-text indexOpenSearch
FormatLibraryWhat is extracted
PDFPyPDF2, pdfplumberText content + metadata
Word (.docx)python-docxText + paragraph structure
Excel (.xlsx)openpyxl, pandasCell values as text
CSVpandasRow data as text
Markdown (.md)built-inText preserving structure
HTMLBeautifulSoupCleaned text from HTML
Plain text (.txt)built-inDirect read
Imagespytesseract (OCR)OCR-extracted text
Embeddings are generated using sentence-transformers. The model is configurable:
ModelDimensionsSpeedQualityNotes
all-MiniLM-L6-v2384FastGoodDefault, ~80 MB
all-mpnet-base-v2768MediumHighLarger model
all-MiniLM-L12-v2384FasterGoodLighter variant
multilingual-e5-base768MediumGood100+ languages
Set via EMBEDDINGS_MODEL in the config.
The document-api includes RNNoise, a recurrent neural network noise suppressor, for processing audio documents. When enabled, noise reduction is applied before transcription.
SettingVariableValues
Enable/disableRNNOISE_ENABLEDtrue · false
Suppression levelRNNOISE_LEVEL0.0 (off) to 1.0 (maximum)

At call time, assistant-api queries document-api with a text query. The service performs vector similarity search and returns the top-k most relevant chunks. Search request
curl -X POST http://localhost:9010/api/v1/document/search \
  -H "Authorization: Bearer <jwt>" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "customer billing issue",
    "knowledge_base_id": "kb_123",
    "top_k": 5,
    "threshold": 0.5
  }'
Response
{
  "results": [
    {
      "chunk_id": "chunk_123",
      "document_id": "doc_456",
      "content": "Billing errors are handled by submitting a refund request...",
      "similarity_score": 0.87,
      "metadata": {
        "page_no": 5,
        "section": "Billing Policy"
      }
    }
  ]
}

Configuration

The document-api uses a YAML config file at docker/document-api/config.yaml combined with environment variables.

Required settings

VariableRequiredDefaultDescription
postgres.host✅ YeslocalhostPostgreSQL host
postgres.db✅ Yesassistant_dbDatabase name
postgres.auth.user✅ Yesrapida_userDatabase user
postgres.auth.password✅ YesDatabase password
elastic_search.host✅ YeslocalhostOpenSearch host
celery.broker✅ Yesredis://localhost:6379/0Celery broker URL
celery.backend✅ Yesredis://localhost:6379/0Celery result backend URL

Tuning settings

SettingDefaultDescription
CHUNK_SIZE1000Characters per document chunk
CHUNK_OVERLAP100Character overlap between adjacent chunks
MAX_FILE_SIZE52428800Maximum upload size in bytes (50 MB)
EMBEDDINGS_MODELall-MiniLM-L6-v2Sentence-transformers model name
EMBEDDINGS_DIMENSION384Embedding vector dimension
CELERY_WORKERS4Number of Celery worker processes
RNNOISE_ENABLEDtrueEnable audio noise reduction
RNNOISE_LEVEL0.5Noise reduction level (0.0–1.0)

Full config file (docker/document-api/config.yaml)

service_name: "Document API"
host: "0.0.0.0"
port: 9010

authentication_config:
  strict: false
  type: "jwt"
  config:
    secret_key: "rpd_pks"   # Must match SECRET in other services

elastic_search:
  host: "opensearch"        # Use "localhost" for local dev
  port: 9200
  scheme: "http"
  max_connection: 5

postgres:
  host: "postgres"          # Use "localhost" for local dev
  port: 5432
  auth:
    password: "rapida_db_password"
    user: "rapida_user"
  db: "assistant_db"
  max_connection: 10
  ideal_connection: 5

internal_service:
  web_host: "web-api:9001"
  integration_host: "integration-api:9004"
  endpoint_host: "endpoint-api:9005"
  assistant_host: "assistant-api:9007"

storage:
  storage_type: "local"
  storage_path_prefix: /app/rapida-data/assets/workflow

celery:
  broker: "redis://redis:6379/0"
  backend: "redis://redis:6379/0"

knowledge_extractor_config:
  chunking_technique:
    chunker: "app.core.chunkers.statistical_chunker.StatisticalChunker"
    options:
      encoder: "app.core.encoders.openai_encoder.OpenaiEncoder"
      options:
        model_name: "text-embedding-3-large"
        api_key: "your_openai_api_key"

Running

document-api is part of the knowledge Docker Compose profile and is not started by default.
# Start document-api together with all other services and opensearch
make up-all-with-knowledge

# Or start document-api individually (opensearch must already be running)
make up-document

# View logs
make logs-document

# Rebuild
make rebuild-document

Health & Observability

EndpointPurpose
GET /readiness/Reports whether the service is ready
GET /healthz/Liveness probe
curl http://localhost:9010/readiness/

Troubleshooting

The Celery worker is likely not running. Check:
# Docker
make logs-document

# Local — confirm Celery worker is running
PYTHONPATH=api/document-api celery -A app.worker inspect active
Reduce batch size to lower memory pressure, or increase it for throughput on capable hardware:
EMBEDDINGS_BATCH_SIZE=8     # Low memory
EMBEDDINGS_BATCH_SIZE=64    # High throughput (GPU recommended)
# List existing indices
curl http://localhost:9200/_cat/indices

# Delete a stale index and allow re-indexing
curl -X DELETE http://localhost:9200/documents-<index-name>
# Reduce Celery worker concurrency
CELERY_CONCURRENCY=2

# Monitor per-container usage
docker stats document-api

Next Steps