Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A developer tests a Claude Code skill with Caliper's pass@k metric, showing multiple attempts to verify reliability…

Open SourceBreakthroughScore: 87

Stop Testing Skills Once: Use Caliper's pass@k to Measure What Actually

Caliper is a lightweight harness that runs Claude Code skills k times, scores them with pass@k, and compares against a no-skill baseline so you know if your skill actually helps.

AAAla SMITH & AI Research Desk·20h ago·3 min read··3 views·AI-Generated·Report error

Source: apimatic.iovia hn_claude_code, medium_agentic, github_changelog, devto_claudecode, gn_claude_code, gn_claude_model, gn_claude_communityCorroborated

How do I reliably test my Claude Code skills to see if they actually improve results?

Install Caliper with `npx skills@latest add edonadei/caliper`, write a YAML eval spec with `prompt` and `expect`, then run `caliper run my-eval.eval.yaml --k 5 --baseline` to see your skill's true pass@k rate and the delta over the base agent.

TL;DR

Run your Claude Code skills k times with a LLM judge or Python assertion to get a real pass@k score and delta vs. no skill.

What Changed — Skills Are Now Testable, Not Guesswork

If you've ever published a Claude Code skill, you've felt the anxiety: Will this work for other people? Will the next model update silently break it? The answer was always "I don't know" — until now.

Caliper (github.com/edonadei/caliper) is a new open-source harness that runs your skill k times in isolated environments and gives you a pass@k score. You define success in a YAML spec with either an LLM judge, a Python assertion, or both. Then you run:

caliper run extract-actions.eval.yaml --k 5 --baseline

And see:

ID      Task                           k(5)  pass@k
task-1  Extracts action items as JSON  5/5   100%  PASS
With skill   100%
No skill      60%
Delta        +40%

The --baseline flag is the killer feature: it re-runs everything without your skill, so you see the delta. A +40% means your skill is actually helping. A 0% or -100% means it's doing nothing or actively harming results.

What It Means For You — Stop Shipping Untested Skills

Most Claude Code skills are tested once, look good, and then break silently when a new model releases. Caliper solves this by making evaluation deterministic and repeatable.

Here's what you can do right now:

Install Caliper via a Claude Code skill:
```
npx skills@latest add edonadei/caliper
```
This installs two skills: evaluate-skill (run and manage evals) and grill-skill (reads your SKILL.md, interviews you, and writes a 3-task spec).

Write your first eval spec in a .eval.yaml file:

tasks:
  - name: Extracts action items as clean JSON
    prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json."
    expect: "A valid JSON array where every item has owner, task, due. No markdown fences."
    assert: |
      import json
      items = json.load(open("/tmp/actions.json"))
      assert isinstance(items, list)
      assert all({"owner","task","due"} <= i.keys() for i in items)

Run it with --k 5 and --baseline to see your skill's true performance.

Try It Now — Your First Caliper Run

# Install Caliper
pip install caliper-eval

![Claude Code showing the code review](https://www.apimatic.io/hs-fs/hubfs/claude-code-review.png?width=932&height=581&name=claude-code-review.png)


# Or add it as a Claude Code skill
npx skills@latest add edonadei/caliper

# Create a simple eval
cat > my-skill.eval.yaml << 'EOF'
tasks:
  - name: Generates valid Python
    prompt: "Write a function that returns the nth Fibonacci number to /tmp/fib.py"
    expect: "A valid Python file with a function that returns correct Fibonacci numbers"
    assert: |
      import sys
      sys.path.insert(0, "/tmp")
      from fib import fibonacci
      assert fibonacci(0) == 0
      assert fibonacci(1) == 1
      assert fibonacci(10) == 55
EOF

# Run it 10 times with baseline
caliper run my-skill.eval.yaml --k 10 --baseline

Caliper supports multiple backends: you can run the skill on one model and judge with another. This is especially useful if you want to test a Claude Code skill but use a cheaper model (like GPT-4o-mini) for judging.

The Bottom Line

Testing agentic code is fundamentally different from testing deterministic code. A skill that works once might fail 40% of the time. Caliper gives you the data to know for sure — and the --baseline flag tells you if your skill is actually adding value or just getting in the way.

Claude Code instance returns a plan file after planning.

Source: apimatic.io

Source: gentic.news · 20h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

**What Claude Code users should do differently:** 1. **Stop shipping skills after one test.** Install Caliper today and run every skill through a `--k 5` eval with `--baseline`. If the delta is ≤0%, your skill is noise — delete it or iterate. 2. **Use `grill-skill` for new skills.** Before writing a skill, run `grill-skill` which will read your SKILL.md, interview you about what "good" looks like, and auto-generate a 3-task eval spec (happy path, edge case, adversarial). This forces you to define success before you start coding. 3. **Add eval specs to your CLAUDE.md.** Reference your eval files so future sessions can run them automatically. Example: ```markdown ## Testing - Run `caliper run evals/ --k 3 --baseline` before shipping any skill change - All skills must show ≥80% pass@k with ≥+20% delta vs. baseline ``` 4. **Run evals after every model update.** When Claude Code updates its model, re-run your eval suite. If pass@k drops, you'll know immediately instead of hearing about it from users.

#workflow #evaluation #tooling #testing #skills

Compare side-by-side

Claude Code vs Caliper

→

Mentioned in this article

Caliper Claude Code pass@k

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Open Source2 shared topics

Shopify's Catalog API Goes Self-Serve as Amazon, Meta, and Microsoft Back Its Commerce Protocol

Open Source

Zhipu AI Stock Surges 48% After Open-Sourcing GLM-5.2 Amid US Ban on

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in Open Source

View all

A close-up of dense lines of C and CUDA code on a dark screen, with a terminal window showing compilation output in…

Open Source

NanoEuler: GPT-2-Scale 116M Model Built in Pure C/CUDA From Scratch

NanoEuler is a 116M-parameter GPT-2-scale model built in pure C/CUDA from scratch. It provides a complete educational training pipeline for understanding LLMs at the lowest level.

github.com/1d ago/3 min read

open sourcecudaai models

Zhipu AI engineer points at monitor displaying GLM-5.2 ranking chart, office with coding screens visible…

Open SourceBreakthrough

100

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

Zhipu AI's GLM-5.2 ranks top-3 globally on a coding benchmark, with US engineers calling it a daily driver superior to GPT-5.5.

scmp.com/4d ago/3 min read/Widely Reported

open sourcechinacoding

Open Source

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single

Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.

arxiv.org/5d ago/3 min read

real-time systemsmultimodal modelsai research

What Changed — Skills Are Now Testable, Not Guesswork

What It Means For You — Stop Shipping Untested Skills

Try It Now — Your First Caliper Run

The Bottom Line

AI Analysis

✨AI Toolslive

Related Articles

Caliper: Run Your Claude Code Skills k Times and Get a pass@k Score That

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

MCP Server Versioning: How to Avoid Breaking All Your AI Clients (Like I

5 Harness Internals That Changed How I Use Claude Code Daily

Shopify's Catalog API Goes Self-Serve as Amazon, Meta, and Microsoft Back Its Commerce Protocol

Zhipu AI Stock Surges 48% After Open-Sourcing GLM-5.2 Amid US Ban on

The framework underneath this story

More in Open Source

NanoEuler: GPT-2-Scale 116M Model Built in Pure C/CUDA From Scratch

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single