Nautilus-Compass v1.1.0 shipped 12 hours after v1.0.0, fixing a recall consumption failure caught 5 hours post-launch. The bug: Claude Code agents saw file titles from memory recall but did not read the file body, reproducing prior mistakes.
Key facts
- v1.1.0 shipped 12 hours after v1.0.0.
- Top-3 hits embed first 800 characters of body.
- 35 negative anchors tracked by drift detector.
- Tested on 130MB session: 41 recall hits, 0 consumed.
- Reproduction cost: $3.50 end-to-end.
The Recall Consumption Bug
According to the Compass v1.1.0 release post, a sister Claude Code dialog was supposed to publish a long-form article to WeChat using a 6-step quality pipeline documented in cross-session memory. Compass recall fired correctly — the file appeared in the agent's UserPromptSubmit hook output. But the agent saw the title and 80-character description, then acted. It did not Read the file body. The actual rules — how to walk audit-gate, which wxid, what xhs-cards-embed structure looks like — never entered the agent's working context.
The agent then reproduced exactly the failure mode the file was written to prevent: ad-hoc _tmp_publish_v8.cjs scripts, no critic round, wrong login path. The user's diagnosis was sharp: "compass 召回到了 · 我没消费 · 这是 agent 层的人格漂移 · 不是 compass 本身的失败." The release post notes this is structural: returning title + 120-char description made it easy to skim and assume you had read the file when you had only read the index.
Three-Layer Fix
v0 — Embed body in top-3 hits. Top-3 recall hits now embed the first 800 characters of post-frontmatter body in an indented │ block. The agent gets the rules in its working context without an additional Read tool call. Tail hits 4..K stay header-only to keep the response bounded at ~3KB total.
v1 — Embed past-mistake body in anti-anchor alerts. Compass's drift detector matches the current prompt against 35 negative anchors learned from prior mistakes. Previously, alerts just showed the anchor label. v1.1.0 embeds body from the most-relevant past lesson session using a two-tier match: substring 6-gram against the anchor + lesson-type frontmatter, falling back to recent drift!=green sessions.
v2 — Detect 'recall fired but not consumed'. A new module recall_consumption.py walks back through the live session jsonl file, finds recent recall blocks, extracts memory file paths, then checks subsequent assistant turns for matching Read tool calls. If recall surfaced N paths and 0 got read, that is the failure signature. Wired into the drift_check MCP tool result and a mid_session_hook every 25 tool calls, which only nags when >=3 unconsumed AND ratio < 0.3. Tested on a 130MB / 32k-line session: 41 recall hits surfaced, 0 consumed.
Governance Plan Scales Without Templates
v1.1.0 also ships a new governance_plan MCP tool that reads two file-exported registries: agents_capabilities.json (what each executor declares it can do) and anchor_packs_phases.json (per-domain DAG of phases). For each phase, V7 ranks executors by capability score (+10 capability match, +5 domain match, +3 anchor pack match), picks the highest, emits a queue file with depends_on_phase_ids. Verified on marketing/dev-tools and caishen-finance/audit domains. Adding a new domain requires one row in each registry — zero V7 source change.
Eval Numbers Unchanged
Eval numbers remain locked from 2026-05-08: LongMemEval-S 56.6% (n=500), EverMemBench-Dynamic Run 1 44.4% (n=500), Run 2 47.3% (n=497), drift detector ROC AUC 0.83, reproduction cost $3.50 end-to-end. v1.1.0 doesn't move the eval numbers — it moves the consumption numbers, the ratio of recall hits whose body actually lands in the agent's working context.
What to watch
Watch for a clean benchmark on recall consumption ratio — the project explicitly solicits suggestions. If adoption grows, look for similar consumption-audit features in competing agent memory systems (Zep, MemOS) within 90 days.
Source: dev.to









