Forest uses a dual-score edge model where semantic (embedding) and tag (IDF-weighted) scores are computed independently but stored together on a single edge per node pair. This replaces the previous single collapsed hybrid score.
Implemented:
- Dual scores on
edges(semantic_score,tag_score,shared_tags) + v2 migration command (forest admin migrate-v2) - Normalized
node_tags+ cachedtag_idffor IDF-weighted tag scoring - Bridge tags via
forest link(#link/...) and tag-aware explore filtering
Deferred (see SCORING_V2_PLAN.md Phase 4):
- HNSW index (
.idx) + fast approximate linking - Embedding storage migration (JSON → float32 BLOB)
- Dual scores, single edge - One edge row per pair with
semantic_scoreandtag_scorefields - Clean separation - Semantic (AI) layer vs Tag (human/agent) layer scored independently
- Tiered performance - Fast approximate on capture, full rescore on demand
- Bridge tags - Unique tags (
#link/xyz) for explicit human-controlled linking
Each edge stores both layer scores:
interface Edge {
id: string;
sourceId: string;
targetId: string;
semanticScore: number | null; // null if no embedding
tagScore: number | null; // null if no shared tags
sharedTags: string[]; // for tag score explainability
createdAt: string;
updatedAt: string;
}Rationale: Avoids graphology multi: true complexity. Simpler queries. One edge = one relationship with multiple facets.
An edge is created if either score exceeds its threshold:
semantic_score >= FOREST_SEMANTIC_THRESHOLD(default: 0.5), ORtag_score >= FOREST_TAG_THRESHOLD(default: 0.3)
Edges with both scores below threshold are deleted.
Source: Embedding cosine similarity
semantic_score = cosine(embedding_a, embedding_b)
Range: 0.0 to 1.0
Index: HNSW in separate .idx file (recommend hnswlib)
Source: IDF-weighted Jaccard similarity
jaccard = |tags_a ∩ tags_b| / |tags_a ∪ tags_b|
For each shared tag t:
idf(t) = log(N / doc_freq(t)) where N = total nodes
avg_idf = mean(idf(t) for t in shared_tags)
max_idf = log(N / 1) # theoretical max: tag on only 1 node
tag_score = jaccard × (avg_idf / max_idf)
Range: 0.0 to 1.0 (bounded by Jaccard × normalized IDF)
Interpretation:
- High Jaccard + rare shared tags → high score
- High Jaccard + common tags → medium score
- Low Jaccard → low score regardless of IDF
Index: node_tags join table (see schema)
Bridge tags are special tags prefixed with link/ that create strong bonds between specific nodes.
Syntax:
# Auto-generated bridge
forest link node-A node-B
# Creates #link/7fa3b2c1 on both nodes
# User-named bridge
forest link node-A node-B --name=chapter-1-arc
# Creates #link/chapter-1-arc on both nodesParser Change Required: Expand tag regex from #[a-zA-Z0-9_-]+ to #[a-zA-Z0-9_/-]+ to allow /.
Behavior:
- Bridge tags have doc_freq = number of nodes in the bridge (typically 2-5)
- IDF is very high → boosts tag_score significantly
- Visible in tag listings with
link/prefix - Remove with:
forest tags remove node-A #link/chapter-1-arc
- Compute embedding for new node
- Query HNSW index for top-k approximate nearest neighbors (k=100)
- Compute
semantic_scoreagainst candidates only - Query
node_tagstable for nodes sharing any tag with new node - Compute
tag_scoreagainst tag-sharing nodes - Union candidate sets, create/update edges exceeding thresholds
- Mark node as
approximate_scored = true
Complexity: O(log n) for HNSW + O(t×m) for tag lookup (t=tags, m=avg nodes per tag)
forest admin rescore [--semantic] [--tags] [--all]- Rebuild
tag_idfcache fromnode_tags - For each node pair, recompute requested scores
- Update edges, delete those below both thresholds
- Set
approximate_scored = falseon affected nodes - Rebuild HNSW index if
--semantic
Complexity: O(n²) - run manually when accuracy matters
-- Nodes table (modified)
CREATE TABLE nodes (
id TEXT PRIMARY KEY,
title TEXT,
body TEXT,
tags TEXT, -- JSON array (kept for compatibility, denormalized)
token_counts TEXT, -- JSON object
embedding BLOB, -- float32[] little-endian (MIGRATION: currently JSON text)
created_at TEXT,
updated_at TEXT,
approximate_scored INTEGER DEFAULT 1
);
-- NEW: Normalized tag storage for efficient lookups
CREATE TABLE node_tags (
node_id TEXT NOT NULL,
tag TEXT NOT NULL,
PRIMARY KEY (node_id, tag),
FOREIGN KEY (node_id) REFERENCES nodes(id) ON DELETE CASCADE
);
CREATE INDEX idx_node_tags_tag ON node_tags(tag);
-- Tag IDF cache (rebuilt on rescore)
CREATE TABLE tag_idf (
tag TEXT PRIMARY KEY,
doc_freq INTEGER NOT NULL,
idf REAL NOT NULL
);
-- Edges table (modified from v1)
CREATE TABLE edges (
id TEXT PRIMARY KEY,
source_id TEXT NOT NULL,
target_id TEXT NOT NULL,
semantic_score REAL, -- NULL if no embeddings available
tag_score REAL, -- NULL if no shared tags
shared_tags TEXT, -- JSON array of shared tag names
status TEXT DEFAULT 'accepted',
created_at TEXT,
updated_at TEXT,
UNIQUE(source_id, target_id)
);
CREATE INDEX idx_edges_source ON edges(source_id);
CREATE INDEX idx_edges_target ON edges(target_id);
-- Keep edge_events for history (unchanged)
CREATE TABLE edge_events (
id TEXT PRIMARY KEY,
edge_id TEXT,
source_id TEXT,
target_id TEXT,
prev_status TEXT,
next_status TEXT,
payload TEXT,
created_at TEXT,
undone INTEGER DEFAULT 0
);-
Embeddings: Currently stored as JSON text arrays. Migration to BLOB (float32 little-endian) reduces storage ~4x and enables direct memory mapping.
-
node_tags table: Must be populated from existing
nodes.tagsJSON. Keep JSON column for backward compatibility butnode_tagsis source of truth for scoring. -
edges table: Add
semantic_score,tag_score,shared_tagscolumns. Populatesemantic_scorefrom existingscorecolumn initially.
| Data | Location | Format |
|---|---|---|
| Nodes, edges, tags | forest.db |
SQLite |
| Embeddings | forest.db (blob column) |
float32 little-endian |
| HNSW index | forest.idx |
hnswlib binary format |
When displaying edges (requires both scores non-null for comparison):
- Blue: Semantic-dominant (
semantic_score > tag_score × 1.2) - Green: Tag-dominant (
tag_score > semantic_score × 1.2) - Purple: Balanced (within 20% of each other)
- Gray: Single-layer only (one score is null)
→ related-node [S:0.72 T:0.45] # Purple, balanced
→ another-node [S:0.81 T:--] # Gray, semantic only
→ tagged-node [S:-- T:0.67] # Gray, tags only
forest explore node-id # All edges
forest explore node-id --by=semantic # Only edges with semantic_score
forest explore node-id --by=tags # Only edges with tag_score
forest explore node-id --min-semantic=0.6 # Semantic threshold filter
forest explore node-id --min-tags=0.4 # Tag threshold filter| Variable | Default | Description |
|---|---|---|
FOREST_SEMANTIC_THRESHOLD |
0.5 | Min semantic_score to contribute to edge |
FOREST_TAG_THRESHOLD |
0.3 | Min tag_score to contribute to edge |
FOREST_HNSW_M |
16 | HNSW connectivity parameter |
FOREST_HNSW_EF |
200 | HNSW search depth |
FOREST_ANN_CANDIDATES |
100 | Top-k candidates from approximate search |
forest admin migrate-v2- Add new columns to
edgestable - Create
node_tagstable, populate fromnodes.tagsJSON - Create
tag_idftable, compute initial values - Copy existing
score→semantic_score - Compute
tag_scorefor all edges - Optionally convert embeddings JSON → BLOB (can be deferred)
Keep score column populated (write to both during transition period).
- Expand tag regex to allow
/insrc/lib/text.ts:155 - Add
node_tagstable and sync logic insrc/lib/db.ts - Add
tag_idftable and IDF computation - Modify
edgesschema: addsemantic_score,tag_score,shared_tags - Update
computeScore()to return both scores separately - Update
linkAgainstExisting()to use dual-threshold logic - Add
forest linkcommand for bridge tag creation - Add
--by,--min-semantic,--min-tagsflags to explore - Color coding in edge display
- HNSW index integration (can be phase 2)
- Embedding BLOB migration (can be phase 2)
- Temporal layer: Session proximity, calendar clustering (third score dimension)
- Explicit edge types: User-defined relationship types beyond similarity
- Daemon mode: Background rescoring when system idle
- Distributed index: For multi-device sync scenarios