Port: 9010
Technology: Python + FastAPI
Language: Python 3.11+
Primary Database: PostgreSQL (document_db)
Search Engine: OpenSearch
Background Jobs: Celery + Redis
Purpose
The Document API handles:- Document upload and processing
- Text extraction from multiple formats
- Text chunking and segmentation
- Semantic embeddings generation
- Vector similarity search
- Knowledge base organization
- Audio noise reduction (RNNoise)
- Entity extraction and tagging
- Full-text search indexing
- RAG query processing
Key Features
Document Processing
- Support for 8+ file formats (PDF, DOCX, XLSX, CSV, MD, HTML, TXT, images)
- Automatic text extraction
- Metadata preservation
- OCR for image documents (planned)
- Language detection
- Character encoding handling
Text Chunking
- Configurable chunk size (default: 1000 chars)
- Overlap between chunks (default: 100 chars)
- Semantic-aware chunking
- Chunk metadata preservation
- Token counting
Embeddings
- Semantic embeddings using sentence transformers
- Vector similarity search
- Cosine distance calculations
- Batch embedding processing
- Embedding caching
Audio Processing
- RNNoise integration for noise reduction
- Audio format conversion
- Sample rate normalization
- Voice isolation
- Quality metrics
Knowledge Organization
- Knowledge base grouping
- Document categorization
- Tag-based organization
- Collection management
- Version control
Configuration
Environment Variables
Source Code Structure
Supported File Formats
| Format | Handler | Extraction |
|---|---|---|
| PyPDF2, pdfplumber | Text + metadata | |
| DOCX | python-docx | Text + formatting |
| XLSX | openpyxl, pandas | Cells as text |
| CSV | pandas | Rows as text |
| MD | markdown | Text structure preserved |
| HTML | BeautifulSoup | HTML to text |
| TXT | built-in | Direct read |
| Images | pytesseract (OCR) | OCR to text |
Building and Running
Development Setup
Production with Docker
Document Processing Flow
Upload Document
Processing Status
Semantic Search
Vector Search Example
Full-Text Search
Text Search Example
Audio Processing with RNNoise
RNNoise Noise Reduction
RNNoise is included for audio noise reduction in document processing:Audio Processing Output
Embeddings Model
Current Model: all-MiniLM-L6-v2
- Dimensions: 384
- Speed: Fast (real-time)
- Quality: Good for semantic search
- Size: ~80MB
- License: Apache 2.0
Alternative Models
Celery Background Jobs
Job Types
-
process_document
- Extract text
- Create chunks
- Generate embeddings
- Index in OpenSearch
-
generate_embeddings
- Generate embeddings for chunks
- Batch processing
-
process_audio
- Audio noise reduction
- Transcription
- Speaker diarization (planned)
Job Status
Monitoring
Health Check
Metrics
- Documents processed per day
- Average processing time
- Embedding generation latency
- Search query latency
- Storage usage
- Queue depth (Celery jobs)
Logging
Structured logs with:- Document ID
- Processing stage
- Duration
- Chunk count
- Error details (if any)