Evaluation System DS021

Suite-based behavioral evaluation that compares planner configurations, plugin combinations, and retrieval quality. The primary evaluation surface is planner-driven routing, not legacy mode×profile combinations.

Suite Structure

test/evaluation/
├── run.mjs              # Runner
├── suite01/
│   ├── story.nl         # Source text
│   └── eval.json        # Questions + expected results
└── suite03/
    ├── story.nl
    └── eval.json

eval.json — Planner-Centric Combos

The preferred way to define evaluation combinations is through pluginCombos:

{
  "suiteId": "suite01",
  "title": "Lumina-7 — Single-Word Answers",
  "storyFile": "story.nl",
  "pluginCombos": [
    { "label": "planner-auto", "plannerPlugin": "planner-default" },
    { "label": "planner-depth", "plannerPlugin": "planner-depth" },
    { "label": "symbolic-fast",
      "seedDetectorPlugin": "sd-symbolic",
      "kbPlugin": "kb-fast",
      "goalSolverPlugin": "gs-symbolic" },
    { "label": "llm-deep-thinkingdb",
      "seedDetectorPlugin": "sd-llm-deep",
      "kbPlugin": "kb-thinkingdb",
      "goalSolverPlugin": "gs-llm-deep" }
  ],
  "questions": [...]
}

When pluginCombos is omitted, the runner uses default planner-centric combos that let the meta-rational planner decide plugin ordering — this is the recommended evaluation mode.

Default Evaluation Combos

Label	What it tests
`planner-default (auto)`	Let the adaptive cheap-first planner decide everything
`planner-depth (auto)`	Let the heavy-first planner decide everything
`symbolic-fast`	Pinned: symbolic seed + fast KB + symbolic goal
`llm-fast-balanced`	Pinned: LLM-fast seed + balanced KB + LLM-fast goal
`llm-deep-thinkingdb`	Pinned: LLM-deep seed + thinkingdb KB + LLM-deep goal

The first two combos are the most important — they show how well the planner routes requests without human intervention.

What Gets Checked

Check	Category	What it verifies
Intent matching	ANS	At least one intent group mentions expected terms
Context recall	CTX	Retrieved context contains expected terms
Context precision	CTX	Retrieved context does NOT contain unwanted terms
Answer content	ANS	Answer contains expected terms
Answer noise	ANS	Answer does NOT contain unwanted terms

Metrics

Answer pass rate — expected intents matched + required mentions present + forbidden mentions absent
Context pass rate — expected context present + unwanted context absent
Context F1 — recall × precision harmonic mean per question, averaged per suite

CLI

# Default: planner-centric combos
node test/evaluation/run.mjs

# Filter by planner
node test/evaluation/run.mjs --planner-plugin planner-depth

# Filter by specific plugins
node test/evaluation/run.mjs --kb-plugin kb-thinkingdb

# Legacy mode×profile (only when eval.json specifies modes/profiles)
node test/evaluation/run.mjs --mode symbolic-only --profile fast

# Single suite
node test/evaluation/run.mjs --suite suite01

Legacy Compatibility

If eval.json specifies modes and/or profiles arrays, the runner still expands them into mode×profile combinations for backward compatibility. But the recommended approach is pluginCombos or letting the defaults run.

← Planner LLM Cache →