How to Run 60 Code Experiments Overnight with Claude Code's Autoresearch Skill

A developer open-sourced a Claude Code skill that autonomously runs experiments on your codebase, proving what doesn't work is as valuable as what does.

AAAla SMITH & AI Research Desk·Mar 24, 2026·3 min read··214 views·AI-Generated·Report error

Source: reddit.comvia reddit_claude, @bcherny, devto_claudecode, hn_claude_codeMulti-Source

A developer recently tested an "autoresearch" approach on a production Django search system using Claude Code. The results were surprising: 60 iterations, 57 reverted changes, and only 3 kept. But that 93% "failure rate" was the point.

The Technique: Autonomous Experimentation

The developer adapted Andrej Karpathy's autoresearch concept—previously used for ML training—to a real production codebase. Using a custom Claude Code skill, they set up an automated system where Claude would:

Propose a hypothesis about improving search performance
Implement the change in the Django/pgvector/Cohere system
Run tests to measure impact
Analyze results and decide whether to keep or revert

All while the developer was away from the keyboard.

Why It Works: Mapping the Ceiling

The key insight isn't about finding improvements—it's about discovering what doesn't work efficiently. In traditional development, you might:

Spend hours manually testing an idea
Get ambiguous results
Wonder if you should try "just one more tweak"

With autoresearch, Claude systematically explores the solution space, providing concrete data on what's worth pursuing and what's a dead end.

In this case, the system proved:

Title matching as a search signal? Net negative (proved in 2 iterations)
Larger candidate pools? No effect (problem was ranking, not recall)
Fiddling with keyword damping formulas? Scores barely moved

"You can stop tuning this" becomes a data-driven decision, not a gut feeling.

How To Apply It: The Open-Source Skill

The developer open-sourced their implementation: github.com/pjhoberman/autoresearch

To use it with Claude Code:

# Clone the skill repository
cd ~/.claude-code/skills
git clone https://github.com/pjhoberman/autoresearch.git

# Configure your experiment
cat > experiment_config.yaml << EOF
objective: "Improve search relevance scores"
metric: "ndcg@10"  # Your evaluation metric
max_iterations: 30
change_types:
  - "parameter_tuning"
  - "algorithm_modification"
  - "feature_addition"
EOF

# Run with Claude Code
claude code --skill autoresearch --config experiment_config.yaml

The skill works by:

Reading your codebase and test suite
Understanding your performance metrics
Proposing targeted, incremental changes
Running your tests automatically
Learning from each iteration to propose better changes

Beyond ML: Production Code Applications

While originally designed for ML training, this approach works for any codebase with:

Clear metrics (performance scores, response times, error rates)
Automated tests to validate changes
Version control for safe rollbacks

The developer discovered a Redis caching bug during their experiments: keys were hashed on query text but not prompt variations. This would have shipped to production unnoticed without systematic testing.

When To Use Autoresearch

Consider this approach when:

You're optimizing a system with multiple tunable parameters
You suspect there are diminishing returns to manual tuning
You want to validate architectural decisions with data
You need to explore a solution space faster than manual iteration allows

The Real Value: Knowledge, Not Just Improvements

The marginal score improvement was +0.03. The real value was learning:

Which optimization avenues are dead ends
Which hand-built components actually work (validating intuition)
Where the true performance ceiling lies

As the developer noted: "Autoresearch maps where the ceiling is, not just the improvements." For Claude Code users, this means turning optimization from an art into a science—and getting concrete data on when to stop tweaking and start building something new.

Source: gentic.news · Mar 24, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Claude Code users should immediately explore the autoresearch skill for systematic optimization work. Instead of manually testing hypotheses about performance improvements, set up the skill to run overnight experiments. This is particularly valuable for: 1. **Performance tuning**: Set clear metrics (response time, throughput, accuracy) and let Claude explore parameter spaces you wouldn't have time to test manually. 2. **Architecture validation**: When debating between implementation approaches, create A/B test scenarios and let the autoresearch skill gather data on which performs better under various conditions. 3. **Bug hunting**: The skill discovered a Redis caching bug because it systematically tested edge cases. Configure it to explore different usage patterns to uncover hidden issues. The key workflow change: Stop guessing about optimizations. Instead, create a test harness with clear metrics, then delegate the exploration to Claude Code with the autoresearch skill. You'll get data-driven answers about what actually matters in your codebase.

#best-practices #open-source #automation #claude-code

This story is part of

The Instruction Hierarchy Crisis: OpenAI's Internal Fix for a Systemic AI Safety Failure

As public chatbots fail safety tests, OpenAI's quiet IH-Challenge project reveals a deeper struggle to control model agency.

Mentioned in this article

Claude Code Andrej Karpathy

Enjoyed this article?