Overview
Thedocument-api is the knowledge backend for the Rapida platform. It processes documents from PDF, Word, CSV, and other formats into searchable vector embeddings and full-text search indices. At call time, assistant-api queries this service to inject relevant knowledge context into the LLM prompt.
Port
9010 — HTTP (FastAPI / uvicorn)Language
Python 3.11+
FastAPI + Celery
Storage
PostgreSQL
assistant_db
Redis (Celery broker)
OpenSearch (vectors + text)Document processing is asynchronous. When a document is uploaded, the API immediately returns a
document_id with status: processing. Text extraction, chunking, and embedding generation are handled by Celery workers in the background.Components
Document Ingestion Pipeline
Document Ingestion Pipeline
The processing pipeline runs as a Celery task after each upload. The stages are sequential and the document status is updated at each step.
| Stage | Library | Configurable |
|---|---|---|
| Text extraction | format-specific (see table below) | No |
| Chunking | custom splitter | CHUNK_SIZE, CHUNK_OVERLAP |
| Embeddings | sentence-transformers | EMBEDDINGS_MODEL |
| Full-text index | OpenSearch | — |
Supported File Formats
Supported File Formats
| Format | Library | What is extracted |
|---|---|---|
| PyPDF2, pdfplumber | Text content + metadata | |
| Word (.docx) | python-docx | Text + paragraph structure |
| Excel (.xlsx) | openpyxl, pandas | Cell values as text |
| CSV | pandas | Row data as text |
| Markdown (.md) | built-in | Text preserving structure |
| HTML | BeautifulSoup | Cleaned text from HTML |
| Plain text (.txt) | built-in | Direct read |
| Images | pytesseract (OCR) | OCR-extracted text |
Embedding Models
Embedding Models
Embeddings are generated using sentence-transformers. The model is configurable:
Set via
| Model | Dimensions | Speed | Quality | Notes |
|---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | Fast | Good | Default, ~80 MB |
all-mpnet-base-v2 | 768 | Medium | High | Larger model |
all-MiniLM-L12-v2 | 384 | Faster | Good | Lighter variant |
multilingual-e5-base | 768 | Medium | Good | 100+ languages |
EMBEDDINGS_MODEL in the config.Audio Noise Reduction (RNNoise)
Audio Noise Reduction (RNNoise)
The document-api includes RNNoise, a recurrent neural network noise suppressor, for processing audio documents. When enabled, noise reduction is applied before transcription.
| Setting | Variable | Values |
|---|---|---|
| Enable/disable | RNNOISE_ENABLED | true · false |
| Suppression level | RNNOISE_LEVEL | 0.0 (off) to 1.0 (maximum) |
Semantic Search
At call time,assistant-api queries document-api with a text query. The service performs vector similarity search and returns the top-k most relevant chunks.
Search request
Configuration
The document-api uses a YAML config file atdocker/document-api/config.yaml combined with environment variables.
Required settings
| Variable | Required | Default | Description |
|---|---|---|---|
postgres.host | ✅ Yes | localhost | PostgreSQL host |
postgres.db | ✅ Yes | assistant_db | Database name |
postgres.auth.user | ✅ Yes | rapida_user | Database user |
postgres.auth.password | ✅ Yes | — | Database password |
elastic_search.host | ✅ Yes | localhost | OpenSearch host |
celery.broker | ✅ Yes | redis://localhost:6379/0 | Celery broker URL |
celery.backend | ✅ Yes | redis://localhost:6379/0 | Celery result backend URL |
Tuning settings
| Setting | Default | Description |
|---|---|---|
CHUNK_SIZE | 1000 | Characters per document chunk |
CHUNK_OVERLAP | 100 | Character overlap between adjacent chunks |
MAX_FILE_SIZE | 52428800 | Maximum upload size in bytes (50 MB) |
EMBEDDINGS_MODEL | all-MiniLM-L6-v2 | Sentence-transformers model name |
EMBEDDINGS_DIMENSION | 384 | Embedding vector dimension |
CELERY_WORKERS | 4 | Number of Celery worker processes |
RNNOISE_ENABLED | true | Enable audio noise reduction |
RNNOISE_LEVEL | 0.5 | Noise reduction level (0.0–1.0) |
Full config file (docker/document-api/config.yaml)
Running
- Docker Compose
- From Source
document-api is part of the knowledge Docker Compose profile and is not started by default.Health & Observability
| Endpoint | Purpose |
|---|---|
GET /readiness/ | Reports whether the service is ready |
GET /healthz/ | Liveness probe |
Troubleshooting
Document stuck in 'processing' status
Document stuck in 'processing' status
The Celery worker is likely not running. Check:
Embedding generation is slow
Embedding generation is slow
Reduce batch size to lower memory pressure, or increase it for throughput on capable hardware:
OpenSearch index errors
OpenSearch index errors
High memory usage
High memory usage