Forest Scoring & Linking Specification v2

Overview

Forest uses a dual-score edge model where semantic (embedding) and tag (IDF-weighted) scores are computed independently but stored together on a single edge per node pair. This replaces the previous single collapsed hybrid score.

Implementation Status (in this repo)

Implemented:

Dual scores on edges (semantic_score, tag_score, shared_tags) + v2 migration command (forest admin migrate-v2)
Normalized node_tags + cached tag_idf for IDF-weighted tag scoring
Bridge tags via forest link (#link/...) and tag-aware explore filtering

Deferred (see SCORING_V2_PLAN.md Phase 4):

HNSW index (.idx) + fast approximate linking
Embedding storage migration (JSON → float32 BLOB)

Design Principles

Dual scores, single edge - One edge row per pair with semantic_score and tag_score fields
Clean separation - Semantic (AI) layer vs Tag (human/agent) layer scored independently
Tiered performance - Fast approximate on capture, full rescore on demand
Bridge tags - Unique tags (#link/xyz) for explicit human-controlled linking

Edge Model

Single Edge, Dual Scores

Each edge stores both layer scores:

interface Edge {
  id: string;
  sourceId: string;
  targetId: string;
  semanticScore: number | null;  // null if no embedding
  tagScore: number | null;       // null if no shared tags
  sharedTags: string[];          // for tag score explainability
  createdAt: string;
  updatedAt: string;
}

Rationale: Avoids graphology multi: true complexity. Simpler queries. One edge = one relationship with multiple facets.

Edge Creation Rules

An edge is created if either score exceeds its threshold:

semantic_score >= FOREST_SEMANTIC_THRESHOLD (default: 0.5), OR
tag_score >= FOREST_TAG_THRESHOLD (default: 0.3)

Edges with both scores below threshold are deleted.

Scoring Algorithms

1. Semantic Score

Source: Embedding cosine similarity

semantic_score = cosine(embedding_a, embedding_b)

Range: 0.0 to 1.0

Index: HNSW in separate .idx file (recommend hnswlib)

2. Tag Score

Source: IDF-weighted Jaccard similarity

jaccard = |tags_a ∩ tags_b| / |tags_a ∪ tags_b|

For each shared tag t:
  idf(t) = log(N / doc_freq(t))  where N = total nodes

avg_idf = mean(idf(t) for t in shared_tags)
max_idf = log(N / 1)  # theoretical max: tag on only 1 node

tag_score = jaccard × (avg_idf / max_idf)

Range: 0.0 to 1.0 (bounded by Jaccard × normalized IDF)

Interpretation:

High Jaccard + rare shared tags → high score
High Jaccard + common tags → medium score
Low Jaccard → low score regardless of IDF

Index: node_tags join table (see schema)

Bridge Tags

Bridge tags are special tags prefixed with link/ that create strong bonds between specific nodes.

Syntax:

# Auto-generated bridge
forest link node-A node-B
# Creates #link/7fa3b2c1 on both nodes

# User-named bridge
forest link node-A node-B --name=chapter-1-arc
# Creates #link/chapter-1-arc on both nodes

Parser Change Required: Expand tag regex from #[a-zA-Z0-9_-]+ to #[a-zA-Z0-9_/-]+ to allow /.

Behavior:

Bridge tags have doc_freq = number of nodes in the bridge (typically 2-5)
IDF is very high → boosts tag_score significantly
Visible in tag listings with link/ prefix
Remove with: forest tags remove node-A #link/chapter-1-arc

Tiered Linking Strategy

On Capture (Fast Path)

Compute embedding for new node
Query HNSW index for top-k approximate nearest neighbors (k=100)
Compute semantic_score against candidates only
Query node_tags table for nodes sharing any tag with new node
Compute tag_score against tag-sharing nodes
Union candidate sets, create/update edges exceeding thresholds
Mark node as approximate_scored = true

Complexity: O(log n) for HNSW + O(t×m) for tag lookup (t=tags, m=avg nodes per tag)

Full Rescore (Accurate Path)

forest admin rescore [--semantic] [--tags] [--all]

Rebuild tag_idf cache from node_tags
For each node pair, recompute requested scores
Update edges, delete those below both thresholds
Set approximate_scored = false on affected nodes
Rebuild HNSW index if --semantic

Complexity: O(n²) - run manually when accuracy matters

Database Schema (v2)

-- Nodes table (modified)
CREATE TABLE nodes (
  id TEXT PRIMARY KEY,
  title TEXT,
  body TEXT,
  tags TEXT,  -- JSON array (kept for compatibility, denormalized)
  token_counts TEXT,  -- JSON object
  embedding BLOB,  -- float32[] little-endian (MIGRATION: currently JSON text)
  created_at TEXT,
  updated_at TEXT,
  approximate_scored INTEGER DEFAULT 1
);

-- NEW: Normalized tag storage for efficient lookups
CREATE TABLE node_tags (
  node_id TEXT NOT NULL,
  tag TEXT NOT NULL,
  PRIMARY KEY (node_id, tag),
  FOREIGN KEY (node_id) REFERENCES nodes(id) ON DELETE CASCADE
);
CREATE INDEX idx_node_tags_tag ON node_tags(tag);

-- Tag IDF cache (rebuilt on rescore)
CREATE TABLE tag_idf (
  tag TEXT PRIMARY KEY,
  doc_freq INTEGER NOT NULL,
  idf REAL NOT NULL
);

-- Edges table (modified from v1)
CREATE TABLE edges (
  id TEXT PRIMARY KEY,
  source_id TEXT NOT NULL,
  target_id TEXT NOT NULL,
  semantic_score REAL,  -- NULL if no embeddings available
  tag_score REAL,       -- NULL if no shared tags
  shared_tags TEXT,     -- JSON array of shared tag names
  status TEXT DEFAULT 'accepted',
  created_at TEXT,
  updated_at TEXT,
  UNIQUE(source_id, target_id)
);
CREATE INDEX idx_edges_source ON edges(source_id);
CREATE INDEX idx_edges_target ON edges(target_id);

-- Keep edge_events for history (unchanged)
CREATE TABLE edge_events (
  id TEXT PRIMARY KEY,
  edge_id TEXT,
  source_id TEXT,
  target_id TEXT,
  prev_status TEXT,
  next_status TEXT,
  payload TEXT,
  created_at TEXT,
  undone INTEGER DEFAULT 0
);

Migration Notes

Embeddings: Currently stored as JSON text arrays. Migration to BLOB (float32 little-endian) reduces storage ~4x and enables direct memory mapping.
node_tags table: Must be populated from existing nodes.tags JSON. Keep JSON column for backward compatibility but node_tags is source of truth for scoring.
edges table: Add semantic_score, tag_score, shared_tags columns. Populate semantic_score from existing score column initially.

Index Storage

Data	Location	Format
Nodes, edges, tags	`forest.db`	SQLite
Embeddings	`forest.db` (blob column)	float32 little-endian
HNSW index	`forest.idx`	hnswlib binary format

Display & Filtering

Color Coding

When displaying edges (requires both scores non-null for comparison):

Blue: Semantic-dominant (semantic_score > tag_score × 1.2)
Green: Tag-dominant (tag_score > semantic_score × 1.2)
Purple: Balanced (within 20% of each other)
Gray: Single-layer only (one score is null)

Score Display

→ related-node [S:0.72 T:0.45]  # Purple, balanced
→ another-node [S:0.81 T:--]    # Gray, semantic only
→ tagged-node  [S:-- T:0.67]    # Gray, tags only

Filtering

forest explore node-id                    # All edges
forest explore node-id --by=semantic      # Only edges with semantic_score
forest explore node-id --by=tags          # Only edges with tag_score
forest explore node-id --min-semantic=0.6 # Semantic threshold filter
forest explore node-id --min-tags=0.4     # Tag threshold filter

Configuration

Variable	Default	Description
`FOREST_SEMANTIC_THRESHOLD`	0.5	Min semantic_score to contribute to edge
`FOREST_TAG_THRESHOLD`	0.3	Min tag_score to contribute to edge
`FOREST_HNSW_M`	16	HNSW connectivity parameter
`FOREST_HNSW_EF`	200	HNSW search depth
`FOREST_ANN_CANDIDATES`	100	Top-k candidates from approximate search

Migration Path

From v1 (single `score` column)

forest admin migrate-v2

Add new columns to edges table
Create node_tags table, populate from nodes.tags JSON
Create tag_idf table, compute initial values
Copy existing score → semantic_score
Compute tag_score for all edges
Optionally convert embeddings JSON → BLOB (can be deferred)

Rollback

Keep score column populated (write to both during transition period).

Implementation Checklist

Future Considerations

Temporal layer: Session proximity, calendar clustering (third score dimension)
Explicit edge types: User-defined relationship types beyond similarity
Daemon mode: Background rescoring when system idle
Distributed index: For multi-device sync scenarios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forest Scoring & Linking Specification v2

Overview

Implementation Status (in this repo)

Design Principles

Edge Model

Single Edge, Dual Scores

Edge Creation Rules

Scoring Algorithms

1. Semantic Score

2. Tag Score

Bridge Tags

Tiered Linking Strategy

On Capture (Fast Path)

Full Rescore (Accurate Path)

Database Schema (v2)

Migration Notes

Index Storage

Display & Filtering

Color Coding

Score Display

Filtering

Configuration

Migration Path

From v1 (single `score` column)

Rollback

Implementation Checklist

Future Considerations

FilesExpand file tree

SCORING_SPEC.md

Latest commit

History

SCORING_SPEC.md

File metadata and controls

Forest Scoring & Linking Specification v2

Overview

Implementation Status (in this repo)

Design Principles

Edge Model

Single Edge, Dual Scores

Edge Creation Rules

Scoring Algorithms

1. Semantic Score

2. Tag Score

Bridge Tags

Tiered Linking Strategy

On Capture (Fast Path)

Full Rescore (Accurate Path)

Database Schema (v2)

Migration Notes

Index Storage

Display & Filtering

Color Coding

Score Display

Filtering

Configuration

Migration Path

From v1 (single score column)

Rollback

Implementation Checklist

Future Considerations

From v1 (single `score` column)