Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Caliper tool interface showing pass@k scores for Claude Code skills across multiple test runs, with a chart…

Caliper: Run Your Claude Code Skills k Times and Get a pass@k Score That

Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent. Install via pipx or npx.

AAAla SMITH & AI Research Desk·21h ago·4 min read··17 views·AI-Generated·Report error

Source: github.comvia hn_claude_code, devto_claudecodeWidely Reported

How do I test my Claude Code skill reliability with pass@k scoring?

Caliper is a lightweight CLI that runs a Claude Code skill k times in isolated environments, reporting a pass@k score and a delta vs. the base agent. Install with `pipx install caliper-eval`, write a YAML spec, and run `caliper run my-skill.eval.yaml --k 5 --baseline`.

TL;DR

Caliper runs Claude Code skills k times, gives a pass@k score, and shows delta vs. base agent so you know if your skill actually helps.

Key Takeaways

Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent.
Install via pipx or npx.

What Changed — Caliper Brings pass@k Reliability Testing to Claude Code Skills

Skills for Claude Code are non-deterministic. A skill that works on your machine, with your prompt, today, might silently break tomorrow after a model update or a one-line prompt edit. Until now, there was no standard way to catch that.

Caliper is a lightweight, local harness that runs a skill k times in isolated environments and gives you a pass@k score. It answers the question: "How many times did the skill succeed out of k attempts?"

It also includes a --baseline flag that re-runs everything without the skill, so you see the delta — proving whether your skill is actually doing the work, or the base agent would have passed anyway.

What It Means For You — Concrete Impact on Daily Claude Code Usage

If you publish or maintain Claude Code skills, Caliper replaces guesswork with data. Here's what you get:

Track reliability over time. Did your prompt edit actually improve the skill? Run Caliper before and after.
Catch regressions. Does it still pass the workflows it passed last week? Caliper saves results to .caliper/results/.
Compare agents. Run the same skill on Claude Code, Codex, and Pi — see which agent runs it more reliably.
Prove your skill adds value. The delta between "with skill" and "no skill" is your evidence.

Try It Now — How to Install and Run Caliper

Google Antigravity Can Now Use Your Claude Code Skills (I Just Tested ...

Option 1: Install as a skill (works inside Claude Code)

npx skills@latest add edonadei/caliper

Then, inside Claude Code or Codex, use:

/grill-skill ./my-skill/SKILL.md — reads your SKILL.md, interviews you, and writes a 3-task .eval.yaml spec (happy path, edge case, adversarial)
/evaluate-skill run my-skill.eval.yaml --k 3 --baseline — runs the evaluation
/evaluate-skill list — browse past runs
/evaluate-skill report my-skill — view a report

Option 2: Install as a standalone CLI

pipx install caliper-eval  # requires Python 3.10+

Write a YAML spec:

# my-skill.eval.yaml
skill:
  path: ./SKILL.md
  backend: claude-code
judge:
  backend: claude-code
tasks:
  - name: Writes a conventional commit message
    prompt: "Summarize the staged git diff as a commit message."
    expect: >
      The response is a conventional-commit message: a concise subject line under 72 characters, followed by a body explaining why the change was made.

  - name: Generates a valid config file
    cleanup: rm -f /tmp/app.config.json
    prompt: "Generate a config at /tmp/app.config.json with a 'port' of 8080."
    assert: |
      import json
      from pathlib import Path
      data = json.loads(Path("/tmp/app.config.json").read_text())
      assert data["port"] == 8080

Run it:

caliper run my-skill.eval.yaml --k 5 --baseline

Output example:

CALIPER - my-skill - k=5 - claude-code
ID      Task                           k(5)  pass@k
task-1  Extracts action items as JSON  5/5   100%  PASS

With skill   100%
No skill      60%
Delta        +40%

The Eval Starter Pack

Caliper includes four copy-paste templates that catch real agent failures: false success, tool misuse, run-to-run variance, and instruction drift. These are available in the project's GitHub repo.

Why This Matters for Claude Code Users

Skills are how you extend Claude Code's capabilities. But without testing, you're shipping blind. Caliper gives you a pass@k score you can track, compare, and cite when you tell your team "this skill works."

It's also the first tool to surface the delta — how much better the skill performs than the base agent. Sometimes that delta is 0%. Sometimes it's -100%. Now you'll know before your users do.

Source: github.com

[Updated 29 Jun via hn_claude_code]

Caliper now supports running skills on multiple backends including Claude Code, Codex, Pi, Claude API, and OpenAI API, with the ability to use separate backends for the agent and the judge [per Show HN]. The project also introduces two new companion skills: evaluate-skill for managing evals within your workflow, and grill-skill that reads your SKILL.md, interviews you, and auto-generates a 3-task evaluation spec covering happy path, edge case, and adversarial scenarios.

Sources cited in this article

Brings
Show HN

Source: gentic.news · 21h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Here's what you should do differently starting today: **Before publishing a skill**, run `caliper run my-skill.eval.yaml --k 5 --baseline`. If the delta is small or negative, your skill isn't adding value — iterate on the prompt or instructions. If the pass@k is below 80%, your skill is unreliable. Fix it before sharing. **After any model update** (e.g., Claude Opus 4.6 to 4.8), re-run your evals. Model updates can silently break skills. Caliper makes regression detection trivial — just run `caliper run` again and compare the saved results in `.caliper/results/`. **Use the /grill-skill command** to generate specs automatically. It reads your SKILL.md, interviews you about what "good" looks like, and writes a 3-task spec covering happy path, edge case, and adversarial inputs. This lowers the barrier to writing meaningful tests. **Track the delta over time.** If your delta drops from +40% to +10%, something changed — maybe the base model got better, or your skill regressed. Either way, you catch it before users complain.

#evaluation #testing #reliability #claude-code #skills

This story is part of

Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt

Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance

Compare side-by-side

Claude Code vs Caliper

→

Mentioned in this article

Caliper Claude Code GPT-5.6 Sol

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Open Source

Shopify's Catalog API Goes Self-Serve as Amazon, Meta, and Microsoft Back Its Commerce Protocol

Open Source

Zhipu AI Stock Surges 48% After Open-Sourcing GLM-5.2 Amid US Ban on

Open Source

Chinese Lab's Free MoE Model Matches GPT-5.5 on Agentic Coding

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in Open Source

View all

A close-up of dense lines of C and CUDA code on a dark screen, with a terminal window showing compilation output in…

Open Source

NanoEuler: GPT-2-Scale 116M Model Built in Pure C/CUDA From Scratch

NanoEuler is a 116M-parameter GPT-2-scale model built in pure C/CUDA from scratch. It provides a complete educational training pipeline for understanding LLMs at the lowest level.

github.com/19h ago/3 min read

open sourcecudaai models

Zhipu AI engineer points at monitor displaying GLM-5.2 ranking chart, office with coding screens visible…

Open SourceBreakthrough

100

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

Zhipu AI's GLM-5.2 ranks top-3 globally on a coding benchmark, with US engineers calling it a daily driver superior to GPT-5.5.

scmp.com/3d ago/3 min read/Widely Reported

open sourcechinacoding

Open Source

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single

Wan-Streamer v0.1 achieves 200ms model-side latency in a single Transformer for full-duplex audio-visual interaction, eliminating cascaded modules. The paper lacks parameter count and benchmark comparisons, limiting reproducibility.

arxiv.org/4d ago/3 min read

real-time systemsmultimodal modelsai research

Key Takeaways

What Changed — Caliper Brings pass@k Reliability Testing to Claude Code Skills

What It Means For You — Concrete Impact on Daily Claude Code Usage

Try It Now — How to Install and Run Caliper

Option 1: Install as a skill (works inside Claude Code)

Option 2: Install as a standalone CLI

The Eval Starter Pack

Why This Matters for Claude Code Users

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

MCP Server Versioning: How to Avoid Breaking All Your AI Clients (Like I

5 Harness Internals That Changed How I Use Claude Code Daily

Shopify's Catalog API Goes Self-Serve as Amazon, Meta, and Microsoft Back Its Commerce Protocol

Zhipu AI Stock Surges 48% After Open-Sourcing GLM-5.2 Amid US Ban on

Chinese Lab's Free MoE Model Matches GPT-5.5 on Agentic Coding

The framework underneath this story

More in Open Source

NanoEuler: GPT-2-Scale 116M Model Built in Pure C/CUDA From Scratch

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

Wan-Streamer v0.1 Cuts Audio-Visual Interaction Latency to 200ms in Single