Retrieval Algorithms — BM25 & HDC/VSA IMPLEMENTED

MRP-VM uses two complementary retrieval strategies to find relevant evidence. Specs: DS009, DS023, DS024.

BM25 Lexical Search

Standard BM25 with per-field weighting. Matches exact tokens after stemming.

score(query, unit) = Σ fieldWeight[f] × BM25(query, unit[f]) × roleBoost
BM25(q,d) = Σ IDF(t) × (tf × (k1+1)) / (tf + k1 × (1 - b + b × dl/avgdl))
k1 = 1.2, b = 0.75

Field Weights

Field	Weight	Why
topic	1.5×	Most discriminative — what the unit is about
claim	1.0×	Core assertion content
procedure	1.0×	Step content (for Procedure role)
utilityActs	0.8×	Pragmatic act matching
utilityNote	0.6×	Supplementary context
condition	0.6×	Constraints and limitations
role	0.5×	Structural role matching

Tokenization

Lowercase → split on whitespace → strip edge punctuation → remove possessives ('s) → stopword removal → Porter stemming. Hyphenated terms: keep whole + index parts separately.

HDC/VSA Associative Search

Hyperdimensional Computing with 4096-bit binary vectors. Complements BM25 by capturing structural similarity when lexical overlap is partial.

Core Operations

Operation	Implementation	Purpose
Random HV	Seeded PRNG from string hash → 4096 bits	Unique vector per token/symbol
Bind (⊗)	Bitwise XOR	Associate field name with value
Bundle (+)	Majority vote per bit	Combine multiple concepts
Similarity	1 − Hamming(a,b) / 4096	Compare vectors (0.50 = random)

Per-Field Encoding

Each unit is encoded as separate field vectors, not one blob:

role → randomHV(roleName) — exact match on role
topic → encodeNgrams(tokens) — positional unigrams + bigrams
claim → encodeNgrams(tokens) — captures word order
acts → encodeTokens(actList) — bag-of-words

Scoring

fieldScore = max(0, (similarity(query.field, unit.field) - 0.50) × 2)
finalScore = 0.35×topic + 0.35×claim + 0.20×role + 0.10×acts

The 0.50 subtraction removes random noise (two random vectors have ~0.50 similarity).

Score Fusion

When both strategies find the same unit:

fusedScore = 1.0 × bm25_normalized + 0.7 × hdc_normalized + 0.15 (agreement bonus)

Confidence Gap Pruning

After scoring and sorting, candidates with score below topScore × gapThreshold are removed. This prevents low-relevance noise from reaching synthesis.

KB Plugin Defaults

KB Plugin	Strategies	maxResults	minScore	Gap Threshold	Use Case
kb-fast	BM25 only	3	0.3	50%	Simple questions, max precision
kb-balanced	BM25 + HDC escalation	7	0.15	35%	Default, good tradeoff
kb-thinkingdb	BM25 + bounded symbolic expansion	8	0.12	25%	Multi-hop or relation-sensitive retrieval

Escalation (balanced): HDC runs only when BM25 returns fewer than minAcceptableCandidates.
ThinkingDB: bounded symbolic closure over normalized facts as specified by DS025.

← CNL Formats Processing Modes →