A production-ready, full-text search engine built from scratch using core data structure and algorithm (DSA) primitives. Capable of indexing and ranking 50 000+ documents with sub-50 ms query latency, backed by a TF-IDF/BM25 scoring pipeline, PostgreSQL persistence, and a REST API.
- Features
- Architecture
- Project Structure
- Getting Started
- REST API Reference
- Data Ingestion
- Query Syntax
- Configuration
- Testing
- Tech Stack
| Feature | Supported |
|---|---|
| Inverted index with per-document term positions, phrase search, and serialize/deserialize for persistence | Yes |
| Trie with frequency-ranked prefix autocomplete and per-node document-ID sets | Yes |
| Min-Heap with custom comparator for efficient top-K result extraction | Yes |
| TF-IDF & BM25 scoring with field boosts (title > heading > body) and proximity bonuses | Yes |
Rich query syntax — boolean (AND/OR/NOT), phrase ("..."), wildcard (*), fuzzy (~), and field-scoped (title:...) queries |
Yes |
| TTL-based result cache to serve repeated queries in microseconds | Yes |
| PostgreSQL persistence — documents, inverted-index snapshots, term-document entries, and search analytics logs | Yes |
| REST API secured with Helmet, rate limiting, and response compression | Yes |
| 158 tests across 9 suites (unit + integration) with all DB dependencies mocked | Yes |
┌─────────────────────────────────────────────────────────┐
│ REST API (Express) │
│ /api/search /api/autocomplete /api/documents │
└────────────────────────┬────────────────────────────────┘
│
┌──────────▼──────────┐
│ SearchEngine │ ← TTL cache, pagination
└──┬──────────────────┘
│
┌──────────┼──────────────────────┐
│ │ │
┌────▼────┐ ┌───▼────┐ ┌─────────────▼──────┐
│ Ranker │ │ Trie │ │ QueryParser │
│ TF-IDF │ │ (auto- │ │ boolean/phrase/ │
│ BM25 │ │complete│ │ wildcard/fuzzy/field │
└────┬────┘ └───┬────┘ └────────────────────┘
│ │
┌────▼──────────▼──────────────────┐
│ Indexer │
│ DocumentProcessor → InvertedIndex│
└────────────────┬─────────────────┘
│
┌───────────┼───────────┐
│ │
┌────▼────────┐ ┌──────────▼────────┐
│ DocumentRepo│ │ IndexRepository │
│ (PostgreSQL)│ │ (snapshots/logs) │
└─────────────┘ └───────────────────┘
mini-search-engine/
├── src/
│ ├── api/
│ │ ├── middleware/ # requestLogger, rateLimiter, errorHandler
│ │ ├── routes/
│ │ │ ├── search.js # GET /api/search, /autocomplete, /suggest, /stats
│ │ │ └── documents.js # POST/GET/PUT/DELETE /api/documents
│ │ └── server.js
│ ├── config/
│ │ ├── constants.js # Tuning parameters (BM25 k1/b, boosts, cache TTL…)
│ │ └── database.js # pg connection pool
│ ├── data-structures/
│ │ ├── InvertedIndex.js # Postings lists, phrase search, serialization
│ │ ├── MinHeap.js # Generic heap + static topK()
│ │ ├── Trie.js # Prefix autocomplete, frequency tracking
│ │ └── TrieNode.js
│ ├── indexing/
│ │ ├── DocumentProcessor.js # Tokenize → normalize → stop-word → stem → n-gram
│ │ ├── Indexer.js # Index / remove / reindex documents, persistence hooks
│ │ └── TFIDFCalculator.js # log-TF, IDF, TF-IDF, BM25
│ ├── search/
│ │ ├── QueryParser.js # Parse boolean, phrase, wildcard, fuzzy, field queries
│ │ ├── Ranker.js # Score + rank candidates via MinHeap
│ │ └── SearchEngine.js # Unified entry point, cache, pagination
│ ├── storage/
│ │ ├── DocumentRepository.js
│ │ ├── IndexRepository.js
│ │ └── migrations/
│ │ ├── migrate.js
│ │ └── schema.sql
│ └── utils/
│ ├── helpers.js
│ └── logger.js # Winston with file + console transports
├── scripts/
│ └── ingestWikipedia.js # Batch ingestion (1 000 docs/batch) with retry logic
├── tests/
│ ├── unit/ # InvertedIndex, Trie, MinHeap, DocumentProcessor,
│ │ └── ... # TFIDFCalculator, Ranker, QueryParser
│ └── integration/ # SearchEngine end-to-end, API routes (DB mocked)
├── .env.example
└── package.json
| Requirement | Version |
|---|---|
| Node.js | ≥ 18 |
| npm | ≥ 9 |
| PostgreSQL | ≥ 14 |
git clone https://github.com/ayomidelog/Mini-Beta.git
cd Mini-Beta
npm installCopy the example file and fill in your values:
cp .env.example .env| Variable | Default | Description |
|---|---|---|
PORT |
3000 |
HTTP port |
NODE_ENV |
development |
development | test | production |
DB_HOST |
localhost |
PostgreSQL host |
DB_PORT |
5432 |
PostgreSQL port |
DB_NAME |
search_engine |
Database name |
DB_USER |
postgres |
Database user |
DB_PASSWORD |
password |
Database password |
DB_POOL_MAX |
10 |
Max connection pool size |
MAX_RESULTS |
10 |
Default page size |
MIN_SCORE_THRESHOLD |
0.01 |
Minimum relevance score to include |
CACHE_TTL |
300 |
Result cache TTL in seconds |
AUTOCOMPLETE_LIMIT |
10 |
Max autocomplete suggestions |
# Create the database (adjust credentials as needed)
createdb search_engine
# Run migrations
npm run migrate# Production
npm start
# Development (auto-reload)
npm run devThe API will be available at http://localhost:3000.
GET /api/search?q=<query>&page=1&limit=10
Query parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
q |
string | — | Search query (required) |
page |
integer | 1 |
Result page number |
limit |
integer | 10 |
Results per page (max 100) |
Example request
curl "http://localhost:3000/api/search?q=%22event+loop%22+AND+performance&page=1&limit=5"Example response
{
"results": [
{ "docId": "42", "score": 4.87 },
{ "docId": "17", "score": 3.12 }
],
"totalHits": 2,
"took": 11,
"page": 1,
"pageSize": 5
}GET /api/autocomplete?q=<prefix>&limit=5
Returns prefix-matched terms sorted by index frequency.
Example request
curl "http://localhost:3000/api/autocomplete?q=perf&limit=5"Example response
{
"suggestions": [
{ "word": "performance", "frequency": 42 },
{ "word": "perform", "frequency": 18 }
]
}GET /api/suggest?q=<query>
Returns alternative query terms derived from indexed vocabulary (useful for "did you mean?" features).
Example request
curl "http://localhost:3000/api/suggest?q=javascrpt"Example response
{ "suggestions": ["javascript", "javascripts"] }POST /api/documents
Content-Type: application/json
{
"id": "doc-1",
"title": "Node.js internals",
"heading": "Event Loop",
"body": "The event loop is the heart of Node.js...",
"url": "https://example.com/nodejs-internals"
}Response 201 Created
{ "docId": "doc-1", "termCount": 312, "success": true }POST /api/documents/batch
Content-Type: application/json
{
"documents": [
{ "id": "1", "title": "...", "body": "..." },
{ "id": "2", "title": "...", "body": "..." }
]
}GET /api/documents/:id
PUT /api/documents/:id
Content-Type: application/json
{ "title": "Updated title", "body": "Updated body text..." }DELETE /api/documents/:id
GET /health → { "status": "ok", "uptime": 123.4 }
GET /api/search/stats → { "stats": { "termCount": 182340, "documentCount": 50000, "trieSize": 75000, "cacheSize": 42 } }
The included ingestion script reads a newline-delimited JSON file (each line is { "id", "title", "body", "url" }) and indexes documents in batches of 1 000 with exponential-backoff retry on failure.
node scripts/ingestWikipedia.js --file /path/to/articles.ndjsonOr point it at a remote URL:
node scripts/ingestWikipedia.js --url https://example.com/articles.ndjson| Syntax | Example | Behaviour |
|---|---|---|
| Simple terms | node performance |
Scores documents containing any term |
| Boolean AND | node AND performance |
Documents must contain both terms |
| Boolean OR | node OR deno |
Documents containing either term |
| Boolean NOT | javascript NOT typescript |
Excludes documents containing the NOT term |
| Phrase | "event loop" |
Documents where terms appear consecutively |
| Mixed | "event loop" AND performance |
Phrase constraint + boolean filter |
| Wildcard | perform* |
Prefix match on perform |
| Fuzzy | performanc~ |
Returns closest prefix matches via Trie |
| Field-scoped | title:nodejs |
Searches only the specified field |
All tuning parameters live in src/config/constants.js:
| Constant | Default | Description |
|---|---|---|
BM25_K1 |
1.2 |
BM25 term-frequency saturation parameter |
BM25_B |
0.75 |
BM25 document-length normalization |
TITLE_BOOST |
3.0 |
Score multiplier for title field matches |
HEADING_BOOST |
2.0 |
Score multiplier for heading field matches |
BODY_BOOST |
1.0 |
Score multiplier for body field matches |
CACHE_TTL |
300 s |
Time-to-live for cached search results |
CACHE_MAX_SIZE |
1000 |
Maximum number of cached queries |
NGRAM_SIZES |
[2] |
N-gram sizes generated during indexing |
MAX_POSITIONS_STORED |
100 |
Max token positions stored per term per doc |
# Run all tests (158 tests, 9 suites)
npm test
# Unit tests only
npm run test:unit
# Integration tests only
npm run test:integrationTests mock all PostgreSQL dependencies so no live database is required.
| Layer | Technology |
|---|---|
| Runtime | Node.js 18+ |
| Web framework | Express 4 |
| NLP / Stemming | natural (Porter stemmer), stopword |
| Database | PostgreSQL 14+ with pg pool |
| Logging | Winston (console + rotating file transports) |
| Security | Helmet, express-rate-limit |
| Compression | compression |
| Testing | Jest 29, Supertest |
| Dev tooling | Nodemon, dotenv |