Suite-based behavioral evaluation that compares planner configurations, plugin combinations, and retrieval quality. The primary evaluation surface is planner-driven routing, not legacy mode×profile combinations.
test/evaluation/
├── run.mjs # Runner
├── suite01/
│ ├── story.nl # Source text
│ └── eval.json # Questions + expected results
└── suite03/
├── story.nl
└── eval.json
The preferred way to define evaluation combinations is through pluginCombos:
{
"suiteId": "suite01",
"title": "Lumina-7 — Single-Word Answers",
"storyFile": "story.nl",
"pluginCombos": [
{ "label": "planner-auto", "plannerPlugin": "planner-default" },
{ "label": "planner-depth", "plannerPlugin": "planner-depth" },
{ "label": "symbolic-fast",
"seedDetectorPlugin": "sd-symbolic",
"kbPlugin": "kb-fast",
"goalSolverPlugin": "gs-symbolic" },
{ "label": "llm-deep-thinkingdb",
"seedDetectorPlugin": "sd-llm-deep",
"kbPlugin": "kb-thinkingdb",
"goalSolverPlugin": "gs-llm-deep" }
],
"questions": [...]
}
When pluginCombos is omitted, the runner uses default planner-centric combos that let the meta-rational planner decide plugin ordering — this is the recommended evaluation mode.
| Label | What it tests |
|---|---|
planner-default (auto) | Let the adaptive cheap-first planner decide everything |
planner-depth (auto) | Let the heavy-first planner decide everything |
symbolic-fast | Pinned: symbolic seed + fast KB + symbolic goal |
llm-fast-balanced | Pinned: LLM-fast seed + balanced KB + LLM-fast goal |
llm-deep-thinkingdb | Pinned: LLM-deep seed + thinkingdb KB + LLM-deep goal |
The first two combos are the most important — they show how well the planner routes requests without human intervention.
| Check | Category | What it verifies |
|---|---|---|
| Intent matching | ANS | At least one intent group mentions expected terms |
| Context recall | CTX | Retrieved context contains expected terms |
| Context precision | CTX | Retrieved context does NOT contain unwanted terms |
| Answer content | ANS | Answer contains expected terms |
| Answer noise | ANS | Answer does NOT contain unwanted terms |
# Default: planner-centric combos
node test/evaluation/run.mjs
# Filter by planner
node test/evaluation/run.mjs --planner-plugin planner-depth
# Filter by specific plugins
node test/evaluation/run.mjs --kb-plugin kb-thinkingdb
# Legacy mode×profile (only when eval.json specifies modes/profiles)
node test/evaluation/run.mjs --mode symbolic-only --profile fast
# Single suite
node test/evaluation/run.mjs --suite suite01
If eval.json specifies modes and/or profiles arrays, the runner still expands them into mode×profile combinations for backward compatibility. But the recommended approach is pluginCombos or letting the defaults run.