Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A developer tests a Claude Code skill with Caliper's pass@k metric, showing multiple attempts to verify reliability…
Open SourceBreakthroughScore: 87

Stop Testing Skills Once: Use Caliper's pass@k to Measure What Actually

Caliper is a lightweight harness that runs Claude Code skills k times, scores them with pass@k, and compares against a no-skill baseline so you know if your skill actually helps.

·20h ago·3 min read··3 views·AI-Generated·Report error
Share:
Source: apimatic.iovia hn_claude_code, medium_agentic, github_changelog, devto_claudecode, gn_claude_code, gn_claude_model, gn_claude_communityCorroborated
How do I reliably test my Claude Code skills to see if they actually improve results?

Install Caliper with `npx skills@latest add edonadei/caliper`, write a YAML eval spec with `prompt` and `expect`, then run `caliper run my-eval.eval.yaml --k 5 --baseline` to see your skill's true pass@k rate and the delta over the base agent.

TL;DR

Run your Claude Code skills k times with a LLM judge or Python assertion to get a real pass@k score and delta vs. no skill.

What Changed — Skills Are Now Testable, Not Guesswork

If you've ever published a Claude Code skill, you've felt the anxiety: Will this work for other people? Will the next model update silently break it? The answer was always "I don't know" — until now.

Caliper (github.com/edonadei/caliper) is a new open-source harness that runs your skill k times in isolated environments and gives you a pass@k score. You define success in a YAML spec with either an LLM judge, a Python assertion, or both. Then you run:

caliper run extract-actions.eval.yaml --k 5 --baseline

And see:

ID      Task                           k(5)  pass@k
task-1  Extracts action items as JSON  5/5   100%  PASS
With skill   100%
No skill      60%
Delta        +40%

The --baseline flag is the killer feature: it re-runs everything without your skill, so you see the delta. A +40% means your skill is actually helping. A 0% or -100% means it's doing nothing or actively harming results.

What It Means For You — Stop Shipping Untested Skills

Most Claude Code skills are tested once, look good, and then break silently when a new model releases. Caliper solves this by making evaluation deterministic and repeatable.

Here's what you can do right now:

  1. Install Caliper via a Claude Code skill:

    npx skills@latest add edonadei/caliper
    

    This installs two skills: evaluate-skill (run and manage evals) and grill-skill (reads your SKILL.md, interviews you, and writes a 3-task spec).

  2. Write your first eval spec in a .eval.yaml file:

    tasks:
      - name: Extracts action items as clean JSON
        prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json."
        expect: "A valid JSON array where every item has owner, task, due. No markdown fences."
        assert: |
          import json
          items = json.load(open("/tmp/actions.json"))
          assert isinstance(items, list)
          assert all({"owner","task","due"} <= i.keys() for i in items)
    
  3. Run it with --k 5 and --baseline to see your skill's true performance.

Try It Now — Your First Caliper Run

# Install Caliper
pip install caliper-eval

![Claude Code showing the code review](https://www.apimatic.io/hs-fs/hubfs/claude-code-review.png?width=932&height=581&name=claude-code-review.png)


# Or add it as a Claude Code skill
npx skills@latest add edonadei/caliper

# Create a simple eval
cat > my-skill.eval.yaml << 'EOF'
tasks:
  - name: Generates valid Python
    prompt: "Write a function that returns the nth Fibonacci number to /tmp/fib.py"
    expect: "A valid Python file with a function that returns correct Fibonacci numbers"
    assert: |
      import sys
      sys.path.insert(0, "/tmp")
      from fib import fibonacci
      assert fibonacci(0) == 0
      assert fibonacci(1) == 1
      assert fibonacci(10) == 55
EOF

# Run it 10 times with baseline
caliper run my-skill.eval.yaml --k 10 --baseline

Caliper supports multiple backends: you can run the skill on one model and judge with another. This is especially useful if you want to test a Claude Code skill but use a cheaper model (like GPT-4o-mini) for judging.

The Bottom Line

Testing agentic code is fundamentally different from testing deterministic code. A skill that works once might fail 40% of the time. Caliper gives you the data to know for sure — and the --baseline flag tells you if your skill is actually adding value or just getting in the way.

Claude Code instance returns a plan file after planning.


Source: apimatic.io

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

**What Claude Code users should do differently:** 1. **Stop shipping skills after one test.** Install Caliper today and run every skill through a `--k 5` eval with `--baseline`. If the delta is ≤0%, your skill is noise — delete it or iterate. 2. **Use `grill-skill` for new skills.** Before writing a skill, run `grill-skill` which will read your SKILL.md, interview you about what "good" looks like, and auto-generate a 3-task eval spec (happy path, edge case, adversarial). This forces you to define success before you start coding. 3. **Add eval specs to your CLAUDE.md.** Reference your eval files so future sessions can run them automatically. Example: ```markdown ## Testing - Run `caliper run evals/ --k 3 --baseline` before shipping any skill change - All skills must show ≥80% pass@k with ≥+20% delta vs. baseline ``` 4. **Run evals after every model update.** When Claude Code updates its model, re-run your eval suite. If pass@k drops, you'll know immediately instead of hearing about it from users.
Compare side-by-side
Claude Code vs Caliper

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Open Source

View all