A 650-trial activation experiment found Claude skills with directive descriptions hit 100% activation, while passive phrasing collapsed to 37%. The finding, shared by developer @akshay_pachaar, reveals that the YAML description field does 90% of the work in skill reliability.
Key facts
- Directive descriptions achieved 100% activation in 650 trials
- Passive 'Use when...' phrasing collapsed to 37% under load
- 500-line cap is a maximum, not a target
- Past 200 lines, instructions at bottom get ignored
- 30-60 lines recommended for creative skills
Why Skills Fail Silently
Claude skills load through progressive disclosure, a 3-tier system that keeps context lean. Tier 1 (metadata, ~100 tokens per skill) is always loaded in the system prompt. Tier 2 (SKILL.md body) enters context only when Claude decides the skill is relevant based on the description alone. Tier 3 (bundled resources) loads on demand — scripts run via bash with only stdout entering context, reference files load when branched to, and assets never enter context at all [According to @akshay_pachaar].
This architecture means selection failure is the dominant failure mode. If the description doesn't trigger correctly, the body might as well not exist.
The 650-Trial Activation Experiment
A controlled experiment with 650 trials compared two description styles. Directive descriptions — using the template " expert. ALWAYS invoke this skill when the user asks about . Do not directly, use this skill first" — achieved 100% activation. Passive phrasing starting with "Use when..." collapsed to 37% under load [According to @akshay_pachaar].
The template combines positive routing with a negative constraint. The experiment found that either alone is insufficient because Claude finds the bypass.
Six Practices That Compound
1. Write pushy descriptions. The template above is the recommended format. Anthropic's own docx skill ends with an explicit exclusion clause: "Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation." This stops two skills from fighting over the same vocabulary.
2. Keep SKILL.md under 200 lines. The 500-line cap is a maximum, not a target. Past 200 lines, instructions at the bottom get ignored. Move detail to a references/ folder and link to it. For pure judgment skills like creative writing or design, aim for 30-60 lines — more constraints destroy the output.
3. Pair every generation skill with a verifier. Activation is one reliability problem; execution adherence is the harder, silent one. The model rationally skips verification steps because they delay output and add no visible content. Ship a paired verifier with binary, mechanical assertions: regex, parsers, file existence checks. Use LLM-as-judge only for genuinely subjective dimensions.
4. Iterate on the right layer. Skill never fires → change the description. Fires on wrong prompts → add exclusions to the description. Fires but produces wrong format → change the body. Same helper code every run → promote to scripts/. Teams often keep tweaking the body when selection is the actual problem.
5. Build a regression fixture per skill. Three saved prompts, expected behaviors, re-run on every model update. Model behavior shifts silently, and without a fixture you discover regressions through user complaints.
The Deepest Insight
Maturation in this craft involves removing something, not adding it. Compress descriptions, sharpen triggers, trust the model with enough latitude to apply judgment correctly. Skills are Markdown with a tiny bit of YAML and some optional scripts. The leverage is in what you leave out [According to @akshay_pachaar].
What to watch
Watch for Anthropic to publish official skill authoring guidelines or a CLI linter that enforces description formatting. Also track whether Claude model updates silently change activation behavior — regression fixtures become essential with each major release.









