A fully offline RAG-based document question answering system optimized for Windows PCs. Features semantic search, hybrid retrieval, and CPU-based LLM inference with GGUF models.
Two first-class delivery options share the same offline RAG capabilities:
- Desktop app — Python + PyInstaller + llama.cpp/GGUF (CustomTkinter GUI / FastAPI).
- HTML5 web app (
web_ui/) — a fully self-contained, STIG-scannable archive that runs entirely in the browser with no runtime downloads.
The browser app is a complete, offline RAG client. See PACKAGING.md for the build/bundle steps.
- Fully offline, packaged models — embeddings (bge-small ONNX), ONNX Runtime WASM, and the
browser LLM are served same-origin from
public/models/; nothing is fetched from a CDN or the HuggingFace Hub at runtime. A readiness gate reports "models ready vs missing". - Two user-selectable browser engines — wllama (llama.cpp WASM, CPU/SIMD, no WebGPU, the default and most robust on i5/Iris Xe) and WebLLM (WebGPU, faster when available). A hardware-capability panel detects WebGPU/threads/memory and recommends an engine.
- Multimodal — attach a screenshot in chat and ask about it (wllama + LFM2-VL mmproj), offline.
- Chat UX — streaming with interactive source citations, regenerate, conversation export (Markdown/JSON), and Fast/Balanced/Quality RAG presets.
- Self-contained archive —
npm run build:offlineproduces a validatedweb_ui/dist/the desktop FastAPI server (or any root static host) serves with the COOP/COEP headers wllama needs.
- Application Shell: Navigation rail with Chat, Documents, Settings pages and responsive flexbox layout
- Theme System: Dark/light mode toggle with system preference detection and localStorage persistence
- Toast Notifications: Non-blocking toast system with success/error/info variants and entrance animations
- Keyboard Shortcuts: Ctrl+Enter (send), Ctrl+L (clear chat), Ctrl+, (open settings) with input/textarea focus guard
- Testing Framework: vitest configured with @testing-library/react and jsdom environment
- Offline-First Design: No internet required after initial setup
- Multi-format Support: PDF, DOCX, PPTX, TXT, MD documents
- Hybrid Retrieval: BM25 + Vector search with Reciprocal Rank Fusion (RRF)
- Window Expansion: Automatically fetches adjacent context chunks
- Smart Chunking: Paragraph and sentence boundary aware
- Cross-Encoder Reranking: Optional MS MARCO MiniLM for precise ranking
The application uses GGUF models via llama-cpp-python for fully offline inference:
- Default Model: Gemma 4 E2B (Q5_K_M GGUF, ~3.1GB) — bundled
- Set via:
RAG_GGUF_PATHenvironment variable or--gguf-pathCLI option - No GPU required
- No network access required
- ~5-10 tokens/second on standard CPU
- Windows 11 (64-bit)
- Intel Core i5 11th generation or newer (or equivalent AMD Ryzen 5000+)
- Intel integrated graphics (present on all 11th gen+ Intel CPUs) — no discrete GPU required
- 16GB RAM
- ~4GB free storage for model + app
- Performance: ~5-7 tokens/second
- Intel Core i7 12th generation or newer (or equivalent AMD Ryzen 7000+)
- Intel Iris Xe integrated graphics or discrete GPU
- 32GB RAM
- SSD for vector database
- Performance: ~10-15 tokens/second
- High-end CPU (Intel Core i9 or AMD Ryzen 9)
- 64GB RAM
- Performance: ~15-20 tokens/second (CPU-only with GGUF)
- Streaming Chat Interface: Full-featured chat page (
ChatPage.tsx) with real-time token streaming display using RAF-batched updates viaTokenStreamManager - Role-Based Message Bubbles: Distinct styling for user, assistant, and system messages with relative timestamps ("2m ago", "just now")
- Inline Markdown Renderer: Zero-dependency markdown parser supporting bold, italic, inline code, fenced code blocks, ordered/unordered lists, and links with URL validation (rejects javascript: and data: URLs)
- Source Citation Pills: Expandable/collapsible source pills with filename truncation, full path reveal on click, and one-click copy-to-clipboard
- Inference Mode Toggle: Status indicator (green/yellow/red) for browser-local vs API mode with server connectivity check against
/auth/statusendpoint - Streaming Cursor Animation: Blinking cursor (
@keyframes blink) appended to assistant messages during streaming for visual feedback - Copy Message: Hover-to-reveal copy button on user and assistant bubbles with 1.5s "Copied!" feedback
- Streaming Indicator: Bouncing dots animation (setInterval-based, 3 dots cycling at 200ms) shown below messages during generation
- Operation Cancellation: Cancel button stops
TokenStreamManager, clears pending mock timers, and marks streaming messages complete
- Dual-Mode Context:
InferenceModeContext(InferenceModeContext.tsx) managesbrowser-localvsapimode via React context - localStorage Persistence: Mode preference and server URL stored under
inference-modekey; survives page refresh - Server Connectivity Check:
checkServerConnectivity()pings/auth/statuswith 5s timeout, handles abort for rapid toggles, updatesisServerConnectedandmodeErrorstate - Model Loading Progress:
modelLoadingProgress(0–100) displayed in blocking overlay when browser-local model is initializing - API Mode Warning: "Server not connected" warning shown in header when API mode is active but server is unreachable
- Expandable Citations: Click to expand source pills showing full filename, page number, and content preview
- One-Click Copy: Copy button on each pill copies citation text to clipboard
- Hover Preview: Hover shows truncated source preview with tooltip for full content
- Phase Attribution: Pills labeled with "Phase 3" or "Phase 4" indicating extraction source
- CTkTooltip Class: Non-blocking hover tooltips with 500ms delay for all settings fields
- Contextual Help: Each RAG configuration field has descriptive hint text explaining its purpose
- Dark Theme Tooltips: Tooltips use dark background (#3a3a4e) with white text for consistent visibility
- Browser-Side Extraction: All document processing happens locally in the browser with no server uploads
- Multi-Format Support: PDF, DOCX, XLSX, PPTX, TXT, and MD files via dedicated extractors
- Extractor Factory:
ExtractorFactoryselects the appropriate extractor based on MIME type - Semantic Chunking: Faithful Python port with paragraph/sentence boundary awareness, configurable overlap, page mapping, and SHA256 content IDs
- IndexedDB Storage: Documents, chunks, and metadata persisted locally via
document-store.ts - Documents Page: Full-featured
/documentspage with drag-and-drop upload, file processing pipeline, and document list with status tracking - DropZone Component: Drag-and-drop or click-to-browse file input with visual feedback and progress indication
- DocumentList Component: Paginated document list showing name, type, size, status, and date with delete functionality
| Format | Extractor | Library |
|---|---|---|
pdf-extractor.ts |
pdfjs-dist | |
| DOCX | docx-extractor.ts |
mammoth |
| XLSX | xlsx-extractor.ts |
xlsx |
| PPTX | pptx-extractor.ts |
jszip + xml parsing |
| TXT/MD | txt-extractor.ts |
Native text processing |
pdfjs-dist^4.4.168mammoth^1.8.0xlsx^0.18.5jszip^3.10.1
- Real-time UI Updates: Font size slider now applies to all widgets immediately when saved
- Debug Mode: Toggle debug-level logging for troubleshooting
- Log File Persistence: Customizable log file path with automatic persistence
- Auto-Reconfiguration: RAG settings (chunk size, n_results, etc.) trigger engine reinitialization when changed
- Thread-Safe RAG Engine: Full serialization via
asyncio.to_thread()wrapping for blocking endpoints - ChromaDB Locking:
RLockfor vector store operations preventing concurrent access corruption - BM25 Index Threadsafety: Incremental add operations protected by RLock for safe concurrent document ingestion
- Lazy LLM Initialization: On-demand LLM loading reduces memory footprint for CLI/API modes
- Cancellation Propagation:
cancellation_eventpassed through query processing for responsive long-operation termination - Memory Budget Checks: Pre-ingestion memory validation prevents OOM errors on large document sets
- QueryTransformer Singleton: Shared transformer instance across requests with thread-safe initialization
- Cross-Encoder threadsafety:
__new__pattern ensures single instance with RLock for concurrent reranking - Neighborhood Expansion: Increased k from 3 to 5 chunks for better context coverage in streaming mode
- Embedding Batch Normalization: Consistent batch sizes for predictable memory usage during ingestion
- Transformers.js Embeddings: Browser-side embedding generation using
bge-small-en-v1.5ONNX model with OPFS caching for offline use - HNSW Vector Index:
EdgeVecRust/WASM-based HNSW index with native IndexedDB persistence for semantic search - FlexSearch Keyword Index: Full-text keyword search with resolution-based scoring for BM25-style matching
- Reciprocal Rank Fusion: Ported RRF algorithm for hybrid retrieval combining semantic and keyword results
- Cross-Encoder Reranking:
ms-marco-MiniLM-L-6-v2reranker with memory-aware conditional activation (skipped on low-memory devices) - Memory-Aware Model Selection: Device memory detection with tier-based configuration (low/medium/high memory tiers)
- WebLLM Service: Browser-side LLM inference using
@mlc-ai/web-llmwithCreateMLCEngineAPI for SmolLM3-3B-Q4_K_M (~1.9GB), OPFS caching, and streaming token generation - Model Download Manager: Progress tracking with speed/ETA calculation, cancellation support, and storage quota error handling
- ModelDownloadProgress UI: Accessible progress bar with ARIA attributes, download speed, ETA countdown, and cancel button
- Model Readiness Gate: Pre-flight checks for WebGPU availability, memory sufficiency (2GB minimum), and OPFS cache status; guides users to server API mode when requirements aren't met
- RAG Orchestrator: Full retrieval pipeline connecting embedding→vector search→keyword search→RRF fusion→reranking→LLM generation; emits typed
RAGEventstream for UI progress - WebGPU Watchdog: Context loss detection via
GPUDevice.lostpromise/event monitoring;createRecoveryHandlerautomatically re-initializes the service after loss
- Thinking Indicator: Animated "Thinking..." with dots while LLM generates responses
- Smart Regeneration: "Regenerate" button replaces the last assistant message instead of creating duplicates
- Feedback System: Working thumbs up/down buttons that persist to database
- Conversation Context Menu: Right-click options to delete or rename conversations
- Time Display: Relative timestamps in sidebar (e.g., "2 min ago", "Yesterday")
- Dedicated Settings Page (
SettingsPage.tsx): Full-featured settings UI with 6 sections:- Inference Mode: Toggle between browser-local (WebGPU) and API server modes with real-time state sync
- Server Configuration: Server URL input with connection test button and status indicators
- Model Selection: Dropdown for AI model choice with cache status, download progress, and cancel support
- Appearance: Theme selector (light/dark/system) with immediate UI application
- Storage: Memory budget display, memory pressure status, and two-click cache clear with confirmation
- About: Version info and app description
- IndexedDB Persistence: User preferences (theme, preferredModel, serverUrl) stored in IndexedDB with automatic load/save
- InferenceModeProvider at Root: Provider moved to
App.tsxroot level for shared state across all pages (Chat, Documents, Settings) - Cross-Browser Compatibility (
browser-compat.ts): Detection for Chrome/Edge 113+ (full WebGPU), Firefox (degraded/experimental), Safari (degraded/partial); provides compatibility guidance with upgrade recommendations - Reusable UI Components:
ErrorBoundary.tsx: Class-based error boundary catching render errors with retry functionalityLoadingSkeleton.tsx: Shimmer-animated skeleton placeholders (text, card, avatar, button variants)EmptyState.tsx: Contextual empty states (no-documents, no-results, no-chat-history, generic) with optional action buttons
- Dual-Mode Streaming:
ChatPagenow connects toRAGOrchestratorfor browser-local inference (WebGPU) andSSEStreamConsumerfor API server streaming, with seamless mode switching - DocumentsPage Search Wiring: Document search now uses the full search pipeline (vector-index + keyword-index + RRF fusion)
- Service Initialization Hook (
useServiceInitialization.ts): Sequential service initialization with proper cleanup on unmount; manages embedding service, vector index, and keyword index lifecycle - Loading Overlay: Service initialization state surfaced via blocking overlay in
App.tsxduring startup - Production Build Fixes: edgevec WASM snippet stub plugin for Vite; pdfjs worker initialization fix for production
- Enter Key Submission: Press Enter to submit questions (no need to click "Ask" button)
- Escape Key: Clears input field or cancels active operations
- Ctrl+Enter: Alternative shortcut for submitting questions
- Ctrl+L: Quick clear chat shortcut
- Ctrl+,: Open settings dialog shortcut
- Inline Typing Indicator: "Thinking..." indicator appears in chat area while processing (replaces status bar overwrite)
- Clear Chat Confirmation: Clear button requires a second click within 3 seconds to prevent accidental deletion
- Settings Switch Labels: CTkSwitch widgets now display descriptive text labels ("Enable Hybrid Search", "Enable Reranking")
- Windows 10 or later
- Python 3.10+
- pip package manager
-
Clone or download the repository
cd doc_qa_app
-
Install dependencies
pip install -r requirements.txt -
Download required models
GGUF Model (Required for LLM inference)
# Default model: Gemma 4 E2B (Q5_K_M) is bundled # To use a custom model, download any GGUF format model # From Hugging Face: https://huggingface.co/models?search=gguf
Embedding Model (Required for search)
# BAAI/bge-small-en-v1.5 is automatically downloaded on first use # Can be manually downloaded if needed for offline installation
-
Run the application
GUI Mode (default):
python main.py
CLI Mode:
python main.py --cliAPI Server:
python main.py --api --port 8080
-
Download the offline installer bundle
- Includes Python embeddable, wheels, and model files
-
Extract the bundle
- Unzip to a directory on your machine
-
Install
- Run the provided installer or execute
main.py
- Run the provided installer or execute
-
No internet required after installation
| Variable | Description | Default |
|---|---|---|
RAG_DB_PATH |
Vector database location | ./doc_qa_db |
RAG_GGUF_PATH |
Path to GGUF model file | - |
RAG_CHUNK_SIZE |
Document chunk size (words) | 512 |
RAG_N_RESULTS |
Context chunks to retrieve | 3 |
RAG_MAX_TOKENS |
Max response tokens | 1024 |
RAG_TEMPERATURE |
LLM temperature | 0.3 |
API_PORT |
API server port | 8080 |
Set both environment variables to enable authentication:
| Variable | Description | Example |
|---|---|---|
ENABLE_AUTH |
Enable authentication (any value enables) | true |
API_KEY |
Secret API key for authentication | your-secure-api-key |
export ENABLE_AUTH=true
export API_KEY="your-secure-api-key"
python main.py --api --port 8080$env:ENABLE_AUTH=$true
$env:API_KEY="your-secure-api-key"
python main.py --api --port 8080All API requests require authentication headers:
- API Key:
X-API-Key: <your-api-key> - JWT Bearer Token:
Authorization: Bearer <jwt-token>
import requests
import os
# Configure authentication
os.environ["ENABLE_AUTH"] = "true"
os.environ["API_KEY"] = "your-secure-api-key"
# Make authenticated request
headers = {
"X-API-Key": os.environ["API_KEY"]
}
response = requests.post("http://localhost:8080/ask", json={
"question": "What are the main findings?",
"n_results": 3
}, headers=headers)
print(response.json())- Always use HTTPS in production
- Rotate API keys regularly
- Store API keys in environment variables, never in code
- See USAGE.md for complete authentication documentation
Backend Selection:
The application uses GGUF models only via llama-cpp-python.
If RAG_GGUF_PATH is set, that model is used. Otherwise, defaults to bundled Gemma 4.
GUI Mode:
- Click "Ingest" button
- Select document folder (folder-based ingestion)
- Wait for processing to complete
Note: GUI supports folder-based batch ingestion. For single-file upload, use API or CLI mode.
CLI Mode:
# Ingest all documents in a directory
python main.py --ingest "C:\Documents\reports"
# Ingest a single file
python main.py --ingest "C:\Documents\report.pdf"API Mode:
import requests
# Ingest entire directory
response = requests.post("http://localhost:8080/ingest", json={
"directory": "C:/Documents/reports"
})
print(response.json())
# Upload and ingest single file
with open("C:/Documents/report.pdf", "rb") as f:
response = requests.post(
"http://localhost:8080/ingest/file",
files={"file": ("report.pdf", f, "application/pdf")}
)
print(response.json())GUI Mode:
- Type your question in the input field
- Press Enter or click "Ask"
- View the answer with source citations
CLI Mode:
# Single question
python main.py --query "What are the main findings?"
# Interactive mode
python main.py --cliAPI Mode:
import requests
response = requests.post("http://localhost:8080/ask", json={
"question": "What are the main findings?",
"n_results": 3
})
print(response.json())Combines BM25 keyword search with vector semantic search using RRF fusion:
- BM25: Fast keyword matching
- Vector: Semantic understanding
- RRF Fusion: Combines both for optimal results
Automatically fetches adjacent chunks around retrieved results:
- Configurable window size (default: 1 chunk)
- Ensures context continuity
- Improves answer quality for multi-part questions
MS MARCO TinyBERT reranker (enabled by default):
- Ranks retrieved chunks by relevance after initial retrieval
- Higher accuracy than pure hybrid search
- Lightweight (~85MB) — optimized for minimum-spec hardware
- Can be disabled via Settings dialog
Keyword-based query expansion (disabled by default):
- Extracts key terms from questions to improve retrieval
- Note: The LLM-based step-back transformation is not wired (latency cost too high for minimum-spec hardware)
LLM Settings:
- GGUF Model Path: Path to
.ggufmodel file
RAG Settings:
- Chunk Size: Number of words per chunk
- Results to Retrieve: Number of chunks for context
- Max Tokens: Maximum response length
- Temperature: Response creativity (0.0-1.0)
Advanced Settings:
- Hybrid Search: Enable/disable BM25+Vector search
- Window Expansion: Number of adjacent chunks to fetch
- Cross-Encoder Reranking: Enable/disable reranking
python main.py [OPTIONS]
Options:
--api Run API server
--cli Run in interactive CLI mode
--ingest PATH Ingest documents from directory
--query QUESTION Ask a question
--db-path PATH Path to vector database (default: ./doc_qa_db)
--model-path PATH Path to GGUF model file (legacy alias for --gguf-path)
--gguf-path PATH GGUF model path
--port PORT API server port (default: 8080)
--chunk-size SIZE Chunk size in words (default: 512)
--chunk-overlap N Chunk overlap in words (default: 50)┌─────────────────────────────────────────────────────────────┐
│ Document Q&A App │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Document │ │ Vector Store │ │ LLM Interface│ │
│ │ Processor │───▶│ (ChromaDB+ │ │ (GGUF-only) │ │
│ │ │ │ BM25+RRF) │◀───│ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └───────────────────┴────────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ RAG Engine │ │
│ │ (Query │ │
│ │ Processing)│ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ GUI / API │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Document Processor
- Extracts text from PDF, DOCX, PPTX, TXT, MD
- Semantic chunking with paragraph/sentence boundaries
- Chunk overlap for context continuity
Vector Store
- ChromaDB for semantic vector storage
- BM25Index for keyword-based search
- Reciprocal Rank Fusion (RRF) for hybrid results
- Window expansion for context fetching
LLM Interface
- GGUF via llama-cpp-python (CPU-only, fully offline)
RAG Engine
- Query processing and routing
- Hybrid search orchestration
- Context assembly and answer generation
- Source citation tracking
Solution 1: GGUF Model Not Found
# Check if model file exists (default bundled model)
dir gemma-4-E2B-it-Q5_K-M.gguf
# If not, download from:
# https://huggingface.co/google/gemma-4-2b-it-ggufSolution 2: Wrong Model Path
- Check Settings dialog for correct path
- Use "Browse" button to select model file
pip install chromadb --break-system-packagespip install sentence-transformers# CPU-only build (recommended)
pip install llama-cpp-python
# With CUDA support (if you have NVIDIA GPU)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121- Embedding model (~80MB) downloads on first use
- Subsequent runs use cached model
- BM25 index is built on first ingestion
Solution 1: Reduce chunk size
python main.py --chunk-size 128Solution 2: Increase chunk overlap
python main.py --chunk-size 256 --chunk-overlap 100Solution 3: Reduce number of results
$env:RAG_N_RESULTS=2Check BM25 is enabled:
# In API, check config
from rag_engine import create_engine_from_env
engine = create_engine_from_env()
print(engine.config.hybrid_search) # Should be TrueVerify both backends loaded:
# Check vector store stats
stats = engine.vector_store.get_stats()
print(f"Embedding model: {stats['embedding_model']}")
print(f"BM25 index: {'Ready' if engine.vector_store.bm25_index else 'Not built'}")| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/stats |
GET | Engine statistics |
/ask |
POST | Ask a question (non-streaming) |
/ask/stream |
POST | Ask a question with SSE streaming |
/search |
POST | Search documents |
/ingest |
POST | Ingest directory |
/ingest/file |
POST | Upload and ingest single file |
/ingest/batch |
POST | Batch upload and ingest (up to 20 files) |
/documents |
GET | List documents |
/documents |
DELETE | Clear all documents |
/settings |
GET | Get current RAG settings |
/settings |
PUT | Update RAG settings |
/auth/status |
GET | Authentication status |
/auth/token |
POST | Obtain JWT token |
import requests
import json
# Configure the engine
os.environ["RAG_GGUF_PATH"] = "path/to/gemma-4-E2B-it-Q5_K-M.gguf"
# Start API server in another terminal
# python main.py --api --port 8080
# Ask a question
response = requests.post("http://localhost:8080/ask", json={
"question": "What are the main findings?",
"n_results": 3
})
result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Inference time: {result['inference_time']:.2f}s")import requests
# Ask with streaming response
with requests.post(
"http://localhost:8080/ask/stream",
json={"question": "What are the main findings?", "n_results": 3},
headers={"Authorization": "Bearer <token>"},
stream=True
) as response:
for line in response.iter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
if "token" in data:
print(data["token"], end="", flush=True)
elif data.get("done"):
print(f"\n\nSources: {data['sources']}")
print(f"Inference time: {data['inference_time']:.2f}s")import requests
# Upload multiple files at once (up to 20)
files = [
("files", ("report1.pdf", open("report1.pdf", "rb"), "application/pdf")),
("files", ("report2.docx", open("report2.docx", "rb"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")),
("files", ("notes.txt", open("notes.txt", "rb"), "text/plain")),
]
response = requests.post(
"http://localhost:8080/ingest/batch",
files=files,
headers={"Authorization": "Bearer <token>"}
)
result = response.json()
print(f"Total: {result['total_files']}, Succeeded: {result['successful']}, Failed: {result['failed']}")
for r in result["results"]:
status = "✓" if r["success"] else "✗"
print(f" {status} {r['filename']}: {r.get('error', r.get('chunks_added', 0))} chunks")import requests
# Get current settings
response = requests.get(
"http://localhost:8080/settings",
headers={"Authorization": "Bearer <token>"}
)
settings = response.json()
print(f"Chunk size: {settings['chunk_size']}, Overlap: {settings['chunk_overlap']}")
# Update settings (partial update supported)
response = requests.put(
"http://localhost:8080/settings",
json={"rag_temperature": 0.7, "rag_chunk_size": 768},
headers={"Authorization": "Bearer <token>"}
)
updated = response.json()
print(f"New temperature: {updated['temperature']}, chunk size: {updated['chunk_size']}")pip install pyinstallerpython build.pyThe executable will be created in dist/DocumentQA.exe.
To create an offline installer:
# Prepare installer files
python scripts/build_installer.py
# Manually download:
# 1. GGUF model to build_installer/models/
# 2. Embedding model to build_installer/embeddings/
# 3. Python embeddable to python_embeddable/
# Run Inno Setup
iscc build_installer/setup.issThis creates an offline installer with all dependencies and models included.
A new browser-based interface is being developed alongside the existing desktop GUI.
- Vite 6 + React 18 + TypeScript 5
- Pure CSS design token system (no Tailwind)
- vitest + @testing-library/react for testing
Translates Python theme.py (ColorTokens, TypeScale, Spacing) to CSS custom properties:
| Token Category | Examples |
|---|---|
| Colors | --color-primary, --color-bubble-user, --color-text-muted |
| Typography | --font-family, --font-size-h1 (20px) through --font-size-small (10px) |
| Spacing | --spacing-xs (2px) through --spacing-section (32px) on 4px grid |
Dark mode overrides via [data-theme="dark"] attribute on <html>.
cd web_ui
npm install
npm run dev # Development server
npm run build # Production build
npm run typecheck # TypeScript validation
npm test # Run tests with vitestThe web UI includes a typed API client (src/lib/api/) for all backend endpoints:
| File | Description |
|---|---|
client.ts |
ApiClient class with methods for all endpoints |
streaming.ts |
SSEStreamConsumer for POST-based SSE streaming |
auth.ts |
Token storage with Safari private mode fallback |
types.ts |
TypeScript interfaces matching FastAPI models |
index.ts |
Barrel export and default client instance |
The web UI includes browser-side document processing with no server uploads:
import { ExtractorFactory } from './lib/processing/extractor-factory';
import { TextChunker } from './lib/processing/text-chunker';
import { DocumentStore } from './lib/storage/document-store';
// Extract text from uploaded file
const extractor = ExtractorFactory.getExtractor(file);
const extraction = await extractor.extract(file);
// Chunk with semantic boundaries
const chunker = new TextChunker({ chunkSize: 512, overlap: 50 });
const chunks = chunker.chunk(extraction.text, extraction.metadata);
// Store in IndexedDB
const store = new DocumentStore();
await store.saveDocument({
id: crypto.randomUUID(),
name: file.name,
type: file.type,
size: file.size,
chunks,
createdAt: new Date()
});
// List all documents
const docs = await store.loadDocuments();
console.log(`Loaded ${docs.length} documents`);| File | Description |
|---|---|
src/types/chat.ts |
Shared ChatMessage, MessageRole, and ChatState types |
src/lib/streaming/TokenStreamManager.ts |
RAF-batched token delivery, unified callbacks for SSE/WebLLM, cancellation support |
src/lib/inference/InferenceModeContext.tsx |
React context for browser-local/api mode with localStorage persistence |
Usage:
import { apiClient, SSEStreamConsumer, login } from './lib/api';
// Ask a question
const answer = await apiClient.ask("What are the main findings?");
// Stream tokens with SSE
const stream = new SSEStreamConsumer('/ask/stream', { question: "Tell me more" });
stream.onToken(token => appendToAnswer(token));
stream.onDone(data => showSources(data.sources));
stream.start();
// Batch upload
const batch = await apiClient.uploadBatch([file1, file2, file3]);
console.log(`Uploaded ${batch.successful}/${batch.total_files} files`);
// Settings
const settings = await apiClient.getSettings();
await apiClient.updateSettings({ rag_temperature: 0.8 });The ML spike page validates Transformers.js, EdgeVec, and FlexSearch on target hardware.
Test Categories:
- Transformers.js: Hugging Face transformers running in browser (feature-extraction pipeline)
- EdgeVec: HNSW-based vector similarity search (edgevec npm package)
- FlexSearch: Full-text search indexing (flexsearch npm package)
Results show pass/fail/skip status, duration, and memory delta for each library.
The web UI implements a complete browser-side search pipeline:
Query → Embeddings (Transformers.js) → HNSW (EdgeVec) → RRF Fusion → Reranker (optional)
↓
Keyword Index (FlexSearch) ──────────────────────────────→
| Component | File | Description |
|---|---|---|
| Embedding Service | src/lib/embeddings/embedding-service.ts |
Transformers.js pipeline with bge-small-en-v1.5 ONNX model, OPFS caching |
| Memory-Aware Selection | src/lib/embeddings/memory-aware.ts |
Device memory detection, tier-based model configuration |
| Vector Index | src/lib/search/vector-index.ts |
EdgeVec HNSW index with IndexedDB persistence |
| Keyword Index | src/lib/search/keyword-index.ts |
FlexSearch with resolution-based scoring |
| RRF Fusion | src/lib/search/rrf-fusion.ts |
Reciprocal Rank Fusion for hybrid results |
| Reranker | src/lib/search/reranker.ts |
Cross-encoder reranker (ms-marco-MiniLM-L-6-v2) |
| Types | src/types/embedding.ts |
EmbeddingDocument, EmbeddingResult interfaces |
| Types | src/types/search.ts |
SearchResult, HybridSearchResult interfaces |
Dependencies added: @huggingface/transformers ^3.0.0, edgevec ^0.6.0, flexsearch ^0.8.0
doc_qa_app/
├── main.py # Main entry point
├── app_gui.py # GUI application (customtkinter)
├── api_server.py # FastAPI REST server
├── rag_engine.py # RAG orchestration
├── document_processor.py # Document extraction & semantic chunking
├── vector_store.py # Vector search (ChromaDB + BM25 + RRF)
├── llm_interface.py # LLM interface (GGUF-only)
├── reranking.py # Cross-encoder reranking
├── query_transformer.py # Query transformation
├── utils.py # Utility functions (RRF fusion)
├── requirements.txt # Python dependencies
├── build.py # PyInstaller build script
├── scripts/
│ └── build_installer.py # Inno Setup preparation
├── web_ui/ # HTML5 Web UI (Phase 3+)
│ ├── src/
│ │ ├── pages/
│ │ │ ├── ChatPage.tsx # Chat UI page (Phase 3)
│ │ │ ├── DocumentsPage.tsx # Document upload & management (Phase 4)
│ │ │ └── SettingsPage.tsx # Settings page with 6 sections (Phase 7)
│ │ ├── components/
│ │ │ ├── ChatMessageBubble.tsx # Role-based message bubbles (Phase 3)
│ │ │ ├── ChatMessageList.tsx # Scrollable message container (Phase 3)
│ │ │ ├── ChatInput.tsx # Input with send/cancel (Phase 3)
│ │ │ ├── MarkdownRenderer.tsx # Zero-dependency markdown (Phase 3)
│ │ │ ├── SourceCitation.tsx # Expandable citation pills (Phase 3)
│ │ │ ├── InferenceModeToggle.tsx # Mode status toggle (Phase 3)
│ │ │ ├── StreamingIndicator.tsx # Bouncing dots animation (Phase 3)
│ │ │ ├── DropZone.tsx # Drag-and-drop file upload (Phase 4)
│ │ │ ├── DocumentList.tsx # Document list with status (Phase 4)
│ │ │ ├── ModelDownloadProgress.tsx # Download progress UI (Phase 6)
│ │ │ ├── ErrorBoundary.tsx # Error boundary with retry (Phase 7)
│ │ │ ├── LoadingSkeleton.tsx # Skeleton loading placeholders (Phase 7)
│ │ │ └── EmptyState.tsx # Empty state messages (Phase 7)
│ │ ├── lib/
│ │ │ ├── streaming/
│ │ │ │ └── TokenStreamManager.ts # RAF-batched token delivery (Phase 3)
│ │ │ ├── inference/
│ │ │ │ └── InferenceModeContext.tsx # Browser-local/API mode context (Phase 3)
│ │ │ ├── browser/
│ │ │ │ └── browser-compat.ts # Cross-browser WebGPU detection (Phase 7)
│ │ │ ├── embeddings/
│ │ │ │ ├── embedding-service.ts # Transformers.js embedding (Phase 5)
│ │ │ │ └── memory-aware.ts # Memory-aware model selection (Phase 5)
│ │ │ ├── llm/
│ │ │ │ ├── web-llm-service.ts # WebLLM browser inference (Phase 6)
│ │ │ │ ├── model-download.ts # Download manager with ETA (Phase 6)
│ │ │ │ ├── model-readiness.ts # WebGPU/memory readiness gate (Phase 6)
│ │ │ │ └── webgpu-watchdog.ts # Context loss recovery (Phase 6)
│ │ │ ├── rag/
│ │ │ │ └── rag-orchestrator.ts # RAG pipeline orchestrator (Phase 6)
│ │ │ ├── search/
│ │ │ │ ├── vector-index.ts # EdgeVec HNSW index (Phase 5)
│ │ │ │ ├── keyword-index.ts # FlexSearch keyword index (Phase 5)
│ │ │ │ ├── rrf-fusion.ts # Reciprocal Rank Fusion (Phase 5)
│ │ │ │ └── reranker.ts # Cross-encoder reranker (Phase 5)
│ │ │ ├── processing/
│ │ │ │ ├── pdf-extractor.ts # PDF text extraction (Phase 4)
│ │ │ │ ├── docx-extractor.ts # DOCX text extraction (Phase 4)
│ │ │ │ ├── xlsx-extractor.ts # XLSX text extraction (Phase 4)
│ │ │ │ ├── pptx-extractor.ts # PPTX text extraction (Phase 4)
│ │ │ │ ├── txt-extractor.ts # TXT/MD text extraction (Phase 4)
│ │ │ │ ├── extractor-factory.ts # MIME-type based extractor selection (Phase 4)
│ │ │ │ └── text-chunker.ts # Semantic chunking with overlap (Phase 4)
│ │ │ └── storage/
│ │ │ └── document-store.ts # IndexedDB document storage (Phase 4)
│ │ ├── types/
│ │ │ ├── chat.ts # Shared chat types (Phase 3)
│ │ │ ├── document.ts # Document types (Phase 4)
│ │ │ ├── embedding.ts # Embedding types (Phase 5)
│ │ │ ├── search.ts # Search result types (Phase 5)
│ │ │ └── llm.ts # LLM types, WebGPU-only inference mode (Phase 6)
│ │ └── styles/
│ │ └── tokens.css # Design tokens + @keyframes blink (Phase 3)
│ ├── package.json
│ └── ...
└── README.md # This file
- Offline-Only: No data leaves your machine
- No Cloud Services: All processing is local
- Model Bundling: Models are stored locally
- Portable: Can be run from USB drive
MIT License - See LICENSE for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- ChromaDB - Vector database
- Sentence Transformers - Embedding models
- llama-cpp-python - GGUF inference
- PyMuPDF - PDF processing
- CustomTkinter - Modern GUI toolkit
- pdfjs-dist - PDF processing (Apache-2.0)
- @huggingface/transformers - In-browser ML models (Apache-2.0)
- @mlc-ai/web-llm - In-browser LLM inference (Apache-2.0)
- edgevec - In-browser vector database (MIT OR Apache-2.0)
- flexsearch - Full-text search (Apache-2.0)
- mammoth - DOCX processing (BSD-2-Clause)
- xlsx - XLSX processing (Apache-2.0)
- jszip - ZIP handling (MIT OR GPL-3.0-or-later)
Version: 2.3.0 Last Updated: 2026-06-20 (Phase 9 → v2.3.0 web overhaul) Hardware: CPU-only optimized for Intel 11th gen i5 and above (16GB RAM minimum)