Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Caliper tool interface showing pass@k scores for Claude Code skills across multiple test runs, with a chart…
Open SourceScore: 100

Caliper: Run Your Claude Code Skills k Times and Get a pass@k Score That

Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent. Install via pipx or npx.

·21h ago·4 min read··17 views·AI-Generated·Report error
Share:
Source: github.comvia hn_claude_code, devto_claudecodeWidely Reported
How do I test my Claude Code skill reliability with pass@k scoring?

Caliper is a lightweight CLI that runs a Claude Code skill k times in isolated environments, reporting a pass@k score and a delta vs. the base agent. Install with `pipx install caliper-eval`, write a YAML spec, and run `caliper run my-skill.eval.yaml --k 5 --baseline`.

TL;DR

Caliper runs Claude Code skills k times, gives a pass@k score, and shows delta vs. base agent so you know if your skill actually helps.

Key Takeaways

  • Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent.
  • Install via pipx or npx.

What Changed — Caliper Brings pass@k Reliability Testing to Claude Code Skills

Skills for Claude Code are non-deterministic. A skill that works on your machine, with your prompt, today, might silently break tomorrow after a model update or a one-line prompt edit. Until now, there was no standard way to catch that.

Caliper is a lightweight, local harness that runs a skill k times in isolated environments and gives you a pass@k score. It answers the question: "How many times did the skill succeed out of k attempts?"

It also includes a --baseline flag that re-runs everything without the skill, so you see the delta — proving whether your skill is actually doing the work, or the base agent would have passed anyway.

What It Means For You — Concrete Impact on Daily Claude Code Usage

If you publish or maintain Claude Code skills, Caliper replaces guesswork with data. Here's what you get:

  • Track reliability over time. Did your prompt edit actually improve the skill? Run Caliper before and after.
  • Catch regressions. Does it still pass the workflows it passed last week? Caliper saves results to .caliper/results/.
  • Compare agents. Run the same skill on Claude Code, Codex, and Pi — see which agent runs it more reliably.
  • Prove your skill adds value. The delta between "with skill" and "no skill" is your evidence.

Try It Now — How to Install and Run Caliper

Google Antigravity Can Now Use Your Claude Code Skills (I Just Tested ...

Option 1: Install as a skill (works inside Claude Code)

npx skills@latest add edonadei/caliper

Then, inside Claude Code or Codex, use:

  • /grill-skill ./my-skill/SKILL.md — reads your SKILL.md, interviews you, and writes a 3-task .eval.yaml spec (happy path, edge case, adversarial)
  • /evaluate-skill run my-skill.eval.yaml --k 3 --baseline — runs the evaluation
  • /evaluate-skill list — browse past runs
  • /evaluate-skill report my-skill — view a report

Option 2: Install as a standalone CLI

pipx install caliper-eval  # requires Python 3.10+

Write a YAML spec:

# my-skill.eval.yaml
skill:
  path: ./SKILL.md
  backend: claude-code
judge:
  backend: claude-code
tasks:
  - name: Writes a conventional commit message
    prompt: "Summarize the staged git diff as a commit message."
    expect: >
      The response is a conventional-commit message: a concise subject line under 72 characters, followed by a body explaining why the change was made.

  - name: Generates a valid config file
    cleanup: rm -f /tmp/app.config.json
    prompt: "Generate a config at /tmp/app.config.json with a 'port' of 8080."
    assert: |
      import json
      from pathlib import Path
      data = json.loads(Path("/tmp/app.config.json").read_text())
      assert data["port"] == 8080

Run it:

caliper run my-skill.eval.yaml --k 5 --baseline

Output example:

CALIPER - my-skill - k=5 - claude-code
ID      Task                           k(5)  pass@k
task-1  Extracts action items as JSON  5/5   100%  PASS

With skill   100%
No skill      60%
Delta        +40%

The Eval Starter Pack

Caliper includes four copy-paste templates that catch real agent failures: false success, tool misuse, run-to-run variance, and instruction drift. These are available in the project's GitHub repo.

Why This Matters for Claude Code Users

Skills are how you extend Claude Code's capabilities. But without testing, you're shipping blind. Caliper gives you a pass@k score you can track, compare, and cite when you tell your team "this skill works."

It's also the first tool to surface the delta — how much better the skill performs than the base agent. Sometimes that delta is 0%. Sometimes it's -100%. Now you'll know before your users do.


Source: github.com

[Updated 29 Jun via hn_claude_code]

Caliper now supports running skills on multiple backends including Claude Code, Codex, Pi, Claude API, and OpenAI API, with the ability to use separate backends for the agent and the judge [per Show HN]. The project also introduces two new companion skills: evaluate-skill for managing evals within your workflow, and grill-skill that reads your SKILL.md, interviews you, and auto-generates a 3-task evaluation spec covering happy path, edge case, and adversarial scenarios.

Sources cited in this article

  1. Brings
  2. Show HN
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Here's what you should do differently starting today: **Before publishing a skill**, run `caliper run my-skill.eval.yaml --k 5 --baseline`. If the delta is small or negative, your skill isn't adding value — iterate on the prompt or instructions. If the pass@k is below 80%, your skill is unreliable. Fix it before sharing. **After any model update** (e.g., Claude Opus 4.6 to 4.8), re-run your evals. Model updates can silently break skills. Caliper makes regression detection trivial — just run `caliper run` again and compare the saved results in `.caliper/results/`. **Use the /grill-skill command** to generate specs automatically. It reads your SKILL.md, interviews you about what "good" looks like, and writes a 3-task spec covering happy path, edge case, and adversarial inputs. This lowers the barrier to writing meaningful tests. **Track the delta over time.** If your delta drops from +40% to +10%, something changed — maybe the base model got better, or your skill regressed. Either way, you catch it before users complain.
This story is part of
Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt
Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance
Compare side-by-side
Claude Code vs Caliper

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Open Source

View all