What Changed — Skills Are Now Testable, Not Guesswork
If you've ever published a Claude Code skill, you've felt the anxiety: Will this work for other people? Will the next model update silently break it? The answer was always "I don't know" — until now.
Caliper (github.com/edonadei/caliper) is a new open-source harness that runs your skill k times in isolated environments and gives you a pass@k score. You define success in a YAML spec with either an LLM judge, a Python assertion, or both. Then you run:
caliper run extract-actions.eval.yaml --k 5 --baseline
And see:
ID Task k(5) pass@k
task-1 Extracts action items as JSON 5/5 100% PASS
With skill 100%
No skill 60%
Delta +40%
The --baseline flag is the killer feature: it re-runs everything without your skill, so you see the delta. A +40% means your skill is actually helping. A 0% or -100% means it's doing nothing or actively harming results.
What It Means For You — Stop Shipping Untested Skills
Most Claude Code skills are tested once, look good, and then break silently when a new model releases. Caliper solves this by making evaluation deterministic and repeatable.
Here's what you can do right now:
Install Caliper via a Claude Code skill:
npx skills@latest add edonadei/caliperThis installs two skills:
evaluate-skill(run and manage evals) andgrill-skill(reads your SKILL.md, interviews you, and writes a 3-task spec).Write your first eval spec in a
.eval.yamlfile:tasks: - name: Extracts action items as clean JSON prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json." expect: "A valid JSON array where every item has owner, task, due. No markdown fences." assert: | import json items = json.load(open("/tmp/actions.json")) assert isinstance(items, list) assert all({"owner","task","due"} <= i.keys() for i in items)Run it with
--k 5and--baselineto see your skill's true performance.
Try It Now — Your First Caliper Run
# Install Caliper
pip install caliper-eval

# Or add it as a Claude Code skill
npx skills@latest add edonadei/caliper
# Create a simple eval
cat > my-skill.eval.yaml << 'EOF'
tasks:
- name: Generates valid Python
prompt: "Write a function that returns the nth Fibonacci number to /tmp/fib.py"
expect: "A valid Python file with a function that returns correct Fibonacci numbers"
assert: |
import sys
sys.path.insert(0, "/tmp")
from fib import fibonacci
assert fibonacci(0) == 0
assert fibonacci(1) == 1
assert fibonacci(10) == 55
EOF
# Run it 10 times with baseline
caliper run my-skill.eval.yaml --k 10 --baseline
Caliper supports multiple backends: you can run the skill on one model and judge with another. This is especially useful if you want to test a Claude Code skill but use a cheaper model (like GPT-4o-mini) for judging.
The Bottom Line
Testing agentic code is fundamentally different from testing deterministic code. A skill that works once might fail 40% of the time. Caliper gives you the data to know for sure — and the --baseline flag tells you if your skill is actually adding value or just getting in the way.

Source: apimatic.io








