Evaluation System DS021

Suite-based behavioral evaluation that compares planner configurations, plugin combinations, and retrieval quality. The primary evaluation surface is planner-driven routing, not legacy mode×profile combinations.

Suite Structure

test/evaluation/
├── run.mjs              # Runner
├── suite01/
│   ├── story.nl         # Source text
│   └── eval.json        # Questions + expected results
└── suite03/
    ├── story.nl
    └── eval.json

eval.json — Planner-Centric Combos

The preferred way to define evaluation combinations is through pluginCombos:

{
  "suiteId": "suite01",
  "title": "Lumina-7 — Single-Word Answers",
  "storyFile": "story.nl",
  "pluginCombos": [
    { "label": "planner-auto", "plannerPlugin": "planner-default" },
    { "label": "planner-depth", "plannerPlugin": "planner-depth" },
    { "label": "symbolic-fast",
      "seedDetectorPlugin": "sd-symbolic",
      "kbPlugin": "kb-fast",
      "goalSolverPlugin": "gs-symbolic" },
    { "label": "llm-deep-thinkingdb",
      "seedDetectorPlugin": "sd-llm-deep",
      "kbPlugin": "kb-thinkingdb",
      "goalSolverPlugin": "gs-llm-deep" }
  ],
  "questions": [...]
}

When pluginCombos is omitted, the runner uses default planner-centric combos that let the meta-rational planner decide plugin ordering — this is the recommended evaluation mode.

Default Evaluation Combos

LabelWhat it tests
planner-default (auto)Let the adaptive cheap-first planner decide everything
planner-depth (auto)Let the heavy-first planner decide everything
symbolic-fastPinned: symbolic seed + fast KB + symbolic goal
llm-fast-balancedPinned: LLM-fast seed + balanced KB + LLM-fast goal
llm-deep-thinkingdbPinned: LLM-deep seed + thinkingdb KB + LLM-deep goal

The first two combos are the most important — they show how well the planner routes requests without human intervention.

What Gets Checked

CheckCategoryWhat it verifies
Intent matchingANSAt least one intent group mentions expected terms
Context recallCTXRetrieved context contains expected terms
Context precisionCTXRetrieved context does NOT contain unwanted terms
Answer contentANSAnswer contains expected terms
Answer noiseANSAnswer does NOT contain unwanted terms

Metrics

CLI

# Default: planner-centric combos
node test/evaluation/run.mjs

# Filter by planner
node test/evaluation/run.mjs --planner-plugin planner-depth

# Filter by specific plugins
node test/evaluation/run.mjs --kb-plugin kb-thinkingdb

# Legacy mode×profile (only when eval.json specifies modes/profiles)
node test/evaluation/run.mjs --mode symbolic-only --profile fast

# Single suite
node test/evaluation/run.mjs --suite suite01

Legacy Compatibility

If eval.json specifies modes and/or profiles arrays, the runner still expands them into mode×profile combinations for backward compatibility. But the recommended approach is pluginCombos or letting the defaults run.