Retrieval Algorithms — BM25 & HDC/VSA IMPLEMENTED

MRP-VM uses two complementary retrieval strategies to find relevant evidence. Specs: DS009, DS023, DS024.

Query BM25 Lexical HDC/VSA Associative Score Fusion Gap Pruning& Top-N

BM25 Lexical Search

Standard BM25 with per-field weighting. Matches exact tokens after stemming.

score(query, unit) = Σ fieldWeight[f] × BM25(query, unit[f]) × roleBoost
BM25(q,d) = Σ IDF(t) × (tf × (k1+1)) / (tf + k1 × (1 - b + b × dl/avgdl))
k1 = 1.2, b = 0.75

Field Weights

FieldWeightWhy
topic1.5×Most discriminative — what the unit is about
claim1.0×Core assertion content
procedure1.0×Step content (for Procedure role)
utilityActs0.8×Pragmatic act matching
utilityNote0.6×Supplementary context
condition0.6×Constraints and limitations
role0.5×Structural role matching

Tokenization

Lowercase → split on whitespace → strip edge punctuation → remove possessives ('s) → stopword removal → Porter stemming. Hyphenated terms: keep whole + index parts separately.

HDC/VSA Associative Search

Hyperdimensional Computing with 4096-bit binary vectors. Complements BM25 by capturing structural similarity when lexical overlap is partial.

Core Operations

OperationImplementationPurpose
Random HVSeeded PRNG from string hash → 4096 bitsUnique vector per token/symbol
Bind (⊗)Bitwise XORAssociate field name with value
Bundle (+)Majority vote per bitCombine multiple concepts
Similarity1 − Hamming(a,b) / 4096Compare vectors (0.50 = random)

Per-Field Encoding

Each unit is encoded as separate field vectors, not one blob:

Scoring

fieldScore = max(0, (similarity(query.field, unit.field) - 0.50) × 2)
finalScore = 0.35×topic + 0.35×claim + 0.20×role + 0.10×acts

The 0.50 subtraction removes random noise (two random vectors have ~0.50 similarity).

Score Fusion

When both strategies find the same unit:

fusedScore = 1.0 × bm25_normalized + 0.7 × hdc_normalized + 0.15 (agreement bonus)

Confidence Gap Pruning

After scoring and sorting, candidates with score below topScore × gapThreshold are removed. This prevents low-relevance noise from reaching synthesis.

KB Plugin Defaults

KB PluginStrategiesmaxResultsminScoreGap ThresholdUse Case
kb-fastBM25 only30.350%Simple questions, max precision
kb-balancedBM25 + HDC escalation70.1535%Default, good tradeoff
kb-thinkingdbBM25 + bounded symbolic expansion80.1225%Multi-hop or relation-sensitive retrieval

Escalation (balanced): HDC runs only when BM25 returns fewer than minAcceptableCandidates.
ThinkingDB: bounded symbolic closure over normalized facts as specified by DS025.