Skip to content

ayomidelog/MiniSearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini Search Engine

A production-ready, full-text search engine built from scratch using core data structure and algorithm (DSA) primitives. Capable of indexing and ranking 50 000+ documents with sub-50 ms query latency, backed by a TF-IDF/BM25 scoring pipeline, PostgreSQL persistence, and a REST API.


Table of Contents


Features

Feature Supported
Inverted index with per-document term positions, phrase search, and serialize/deserialize for persistence Yes
Trie with frequency-ranked prefix autocomplete and per-node document-ID sets Yes
Min-Heap with custom comparator for efficient top-K result extraction Yes
TF-IDF & BM25 scoring with field boosts (title > heading > body) and proximity bonuses Yes
Rich query syntax — boolean (AND/OR/NOT), phrase ("..."), wildcard (*), fuzzy (~), and field-scoped (title:...) queries Yes
TTL-based result cache to serve repeated queries in microseconds Yes
PostgreSQL persistence — documents, inverted-index snapshots, term-document entries, and search analytics logs Yes
REST API secured with Helmet, rate limiting, and response compression Yes
158 tests across 9 suites (unit + integration) with all DB dependencies mocked Yes

Architecture

┌─────────────────────────────────────────────────────────┐
│                     REST API (Express)                   │
│   /api/search   /api/autocomplete   /api/documents       │
└────────────────────────┬────────────────────────────────┘
                         │
              ┌──────────▼──────────┐
              │     SearchEngine    │  ← TTL cache, pagination
              └──┬──────────────────┘
                 │
      ┌──────────┼──────────────────────┐
      │          │                      │
 ┌────▼────┐ ┌───▼────┐  ┌─────────────▼──────┐
 │  Ranker │ │  Trie  │  │  QueryParser        │
 │ TF-IDF  │ │ (auto- │  │ boolean/phrase/      │
 │ BM25    │ │complete│  │ wildcard/fuzzy/field │
 └────┬────┘ └───┬────┘  └────────────────────┘
      │          │
 ┌────▼──────────▼──────────────────┐
 │          Indexer                  │
 │  DocumentProcessor → InvertedIndex│
 └────────────────┬─────────────────┘
                  │
      ┌───────────┼───────────┐
      │                       │
 ┌────▼────────┐   ┌──────────▼────────┐
 │ DocumentRepo│   │  IndexRepository  │
 │ (PostgreSQL)│   │  (snapshots/logs) │
 └─────────────┘   └───────────────────┘

Project Structure

mini-search-engine/
├── src/
│   ├── api/
│   │   ├── middleware/        # requestLogger, rateLimiter, errorHandler
│   │   ├── routes/
│   │   │   ├── search.js      # GET /api/search, /autocomplete, /suggest, /stats
│   │   │   └── documents.js   # POST/GET/PUT/DELETE /api/documents
│   │   └── server.js
│   ├── config/
│   │   ├── constants.js       # Tuning parameters (BM25 k1/b, boosts, cache TTL…)
│   │   └── database.js        # pg connection pool
│   ├── data-structures/
│   │   ├── InvertedIndex.js   # Postings lists, phrase search, serialization
│   │   ├── MinHeap.js         # Generic heap + static topK()
│   │   ├── Trie.js            # Prefix autocomplete, frequency tracking
│   │   └── TrieNode.js
│   ├── indexing/
│   │   ├── DocumentProcessor.js  # Tokenize → normalize → stop-word → stem → n-gram
│   │   ├── Indexer.js            # Index / remove / reindex documents, persistence hooks
│   │   └── TFIDFCalculator.js    # log-TF, IDF, TF-IDF, BM25
│   ├── search/
│   │   ├── QueryParser.js     # Parse boolean, phrase, wildcard, fuzzy, field queries
│   │   ├── Ranker.js          # Score + rank candidates via MinHeap
│   │   └── SearchEngine.js    # Unified entry point, cache, pagination
│   ├── storage/
│   │   ├── DocumentRepository.js
│   │   ├── IndexRepository.js
│   │   └── migrations/
│   │       ├── migrate.js
│   │       └── schema.sql
│   └── utils/
│       ├── helpers.js
│       └── logger.js          # Winston with file + console transports
├── scripts/
│   └── ingestWikipedia.js     # Batch ingestion (1 000 docs/batch) with retry logic
├── tests/
│   ├── unit/                  # InvertedIndex, Trie, MinHeap, DocumentProcessor,
│   │   └── ...                # TFIDFCalculator, Ranker, QueryParser
│   └── integration/           # SearchEngine end-to-end, API routes (DB mocked)
├── .env.example
└── package.json

Getting Started

Prerequisites

Requirement Version
Node.js ≥ 18
npm ≥ 9
PostgreSQL ≥ 14

Installation

git clone https://github.com/ayomidelog/Mini-Beta.git
cd Mini-Beta
npm install

Environment Variables

Copy the example file and fill in your values:

cp .env.example .env
Variable Default Description
PORT 3000 HTTP port
NODE_ENV development development | test | production
DB_HOST localhost PostgreSQL host
DB_PORT 5432 PostgreSQL port
DB_NAME search_engine Database name
DB_USER postgres Database user
DB_PASSWORD password Database password
DB_POOL_MAX 10 Max connection pool size
MAX_RESULTS 10 Default page size
MIN_SCORE_THRESHOLD 0.01 Minimum relevance score to include
CACHE_TTL 300 Result cache TTL in seconds
AUTOCOMPLETE_LIMIT 10 Max autocomplete suggestions

Database Setup

# Create the database (adjust credentials as needed)
createdb search_engine

# Run migrations
npm run migrate

Running the Server

# Production
npm start

# Development (auto-reload)
npm run dev

The API will be available at http://localhost:3000.


REST API Reference

Search

GET /api/search?q=<query>&page=1&limit=10

Query parameters

Parameter Type Default Description
q string Search query (required)
page integer 1 Result page number
limit integer 10 Results per page (max 100)

Example request

curl "http://localhost:3000/api/search?q=%22event+loop%22+AND+performance&page=1&limit=5"

Example response

{
  "results": [
    { "docId": "42", "score": 4.87 },
    { "docId": "17", "score": 3.12 }
  ],
  "totalHits": 2,
  "took": 11,
  "page": 1,
  "pageSize": 5
}

Autocomplete

GET /api/autocomplete?q=<prefix>&limit=5

Returns prefix-matched terms sorted by index frequency.

Example request

curl "http://localhost:3000/api/autocomplete?q=perf&limit=5"

Example response

{
  "suggestions": [
    { "word": "performance", "frequency": 42 },
    { "word": "perform", "frequency": 18 }
  ]
}

Suggestions

GET /api/suggest?q=<query>

Returns alternative query terms derived from indexed vocabulary (useful for "did you mean?" features).

Example request

curl "http://localhost:3000/api/suggest?q=javascrpt"

Example response

{ "suggestions": ["javascript", "javascripts"] }

Documents

Index a document

POST /api/documents
Content-Type: application/json
{
  "id": "doc-1",
  "title": "Node.js internals",
  "heading": "Event Loop",
  "body": "The event loop is the heart of Node.js...",
  "url": "https://example.com/nodejs-internals"
}

Response 201 Created

{ "docId": "doc-1", "termCount": 312, "success": true }

Bulk index documents

POST /api/documents/batch
Content-Type: application/json
{
  "documents": [
    { "id": "1", "title": "...", "body": "..." },
    { "id": "2", "title": "...", "body": "..." }
  ]
}

Retrieve a document

GET /api/documents/:id

Update / reindex a document

PUT /api/documents/:id
Content-Type: application/json
{ "title": "Updated title", "body": "Updated body text..." }

Delete a document

DELETE /api/documents/:id

Stats & Health

GET /health          → { "status": "ok", "uptime": 123.4 }
GET /api/search/stats → { "stats": { "termCount": 182340, "documentCount": 50000, "trieSize": 75000, "cacheSize": 42 } }

Data Ingestion

The included ingestion script reads a newline-delimited JSON file (each line is { "id", "title", "body", "url" }) and indexes documents in batches of 1 000 with exponential-backoff retry on failure.

node scripts/ingestWikipedia.js --file /path/to/articles.ndjson

Or point it at a remote URL:

node scripts/ingestWikipedia.js --url https://example.com/articles.ndjson

Query Syntax

Syntax Example Behaviour
Simple terms node performance Scores documents containing any term
Boolean AND node AND performance Documents must contain both terms
Boolean OR node OR deno Documents containing either term
Boolean NOT javascript NOT typescript Excludes documents containing the NOT term
Phrase "event loop" Documents where terms appear consecutively
Mixed "event loop" AND performance Phrase constraint + boolean filter
Wildcard perform* Prefix match on perform
Fuzzy performanc~ Returns closest prefix matches via Trie
Field-scoped title:nodejs Searches only the specified field

Configuration

All tuning parameters live in src/config/constants.js:

Constant Default Description
BM25_K1 1.2 BM25 term-frequency saturation parameter
BM25_B 0.75 BM25 document-length normalization
TITLE_BOOST 3.0 Score multiplier for title field matches
HEADING_BOOST 2.0 Score multiplier for heading field matches
BODY_BOOST 1.0 Score multiplier for body field matches
CACHE_TTL 300 s Time-to-live for cached search results
CACHE_MAX_SIZE 1000 Maximum number of cached queries
NGRAM_SIZES [2] N-gram sizes generated during indexing
MAX_POSITIONS_STORED 100 Max token positions stored per term per doc

Testing

# Run all tests (158 tests, 9 suites)
npm test

# Unit tests only
npm run test:unit

# Integration tests only
npm run test:integration

Tests mock all PostgreSQL dependencies so no live database is required.


Tech Stack

Layer Technology
Runtime Node.js 18+
Web framework Express 4
NLP / Stemming natural (Porter stemmer), stopword
Database PostgreSQL 14+ with pg pool
Logging Winston (console + rotating file transports)
Security Helmet, express-rate-limit
Compression compression
Testing Jest 29, Supertest
Dev tooling Nodemon, dotenv

About

A production-ready, full-text search engine built from scratch using core data structure and algorithm (DSA) primitives.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages