How to Run 60 Code Experiments Overnight with Claude Code's Autoresearch Skill

How to Run 60 Code Experiments Overnight with Claude Code's Autoresearch Skill

A developer open-sourced a Claude Code skill that autonomously runs experiments on your codebase, proving what doesn't work is as valuable as what does.

Ggentic.news Editorial·9h ago·3 min read·7 views·via reddit_claude
Share:

How to Run 60 Code Experiments Overnight with Claude Code's Autoresearch Skill

A developer recently tested an "autoresearch" approach on a production Django search system using Claude Code. The results were surprising: 60 iterations, 57 reverted changes, and only 3 kept. But that 93% "failure rate" was the point.

The Technique: Autonomous Experimentation

The developer adapted Andrej Karpathy's autoresearch concept—previously used for ML training—to a real production codebase. Using a custom Claude Code skill, they set up an automated system where Claude would:

  1. Propose a hypothesis about improving search performance
  2. Implement the change in the Django/pgvector/Cohere system
  3. Run tests to measure impact
  4. Analyze results and decide whether to keep or revert

All while the developer was away from the keyboard.

Why It Works: Mapping the Ceiling

The key insight isn't about finding improvements—it's about discovering what doesn't work efficiently. In traditional development, you might:

  • Spend hours manually testing an idea
  • Get ambiguous results
  • Wonder if you should try "just one more tweak"

With autoresearch, Claude systematically explores the solution space, providing concrete data on what's worth pursuing and what's a dead end.

In this case, the system proved:

  • Title matching as a search signal? Net negative (proved in 2 iterations)
  • Larger candidate pools? No effect (problem was ranking, not recall)
  • Fiddling with keyword damping formulas? Scores barely moved

"You can stop tuning this" becomes a data-driven decision, not a gut feeling.

How To Apply It: The Open-Source Skill

The developer open-sourced their implementation: github.com/pjhoberman/autoresearch

To use it with Claude Code:

# Clone the skill repository
cd ~/.claude-code/skills
git clone https://github.com/pjhoberman/autoresearch.git

# Configure your experiment
cat > experiment_config.yaml << EOF
objective: "Improve search relevance scores"
metric: "ndcg@10"  # Your evaluation metric
max_iterations: 30
change_types:
  - "parameter_tuning"
  - "algorithm_modification"
  - "feature_addition"
EOF

# Run with Claude Code
claude code --skill autoresearch --config experiment_config.yaml

The skill works by:

  1. Reading your codebase and test suite
  2. Understanding your performance metrics
  3. Proposing targeted, incremental changes
  4. Running your tests automatically
  5. Learning from each iteration to propose better changes

Beyond ML: Production Code Applications

While originally designed for ML training, this approach works for any codebase with:

  • Clear metrics (performance scores, response times, error rates)
  • Automated tests to validate changes
  • Version control for safe rollbacks

The developer discovered a Redis caching bug during their experiments: keys were hashed on query text but not prompt variations. This would have shipped to production unnoticed without systematic testing.

When To Use Autoresearch

Consider this approach when:

  • You're optimizing a system with multiple tunable parameters
  • You suspect there are diminishing returns to manual tuning
  • You want to validate architectural decisions with data
  • You need to explore a solution space faster than manual iteration allows

The Real Value: Knowledge, Not Just Improvements

The marginal score improvement was +0.03. The real value was learning:

  • Which optimization avenues are dead ends
  • Which hand-built components actually work (validating intuition)
  • Where the true performance ceiling lies

As the developer noted: "Autoresearch maps where the ceiling is, not just the improvements." For Claude Code users, this means turning optimization from an art into a science—and getting concrete data on when to stop tweaking and start building something new.

AI Analysis

Claude Code users should immediately explore the autoresearch skill for systematic optimization work. Instead of manually testing hypotheses about performance improvements, set up the skill to run overnight experiments. This is particularly valuable for: 1. **Performance tuning**: Set clear metrics (response time, throughput, accuracy) and let Claude explore parameter spaces you wouldn't have time to test manually. 2. **Architecture validation**: When debating between implementation approaches, create A/B test scenarios and let the autoresearch skill gather data on which performs better under various conditions. 3. **Bug hunting**: The skill discovered a Redis caching bug because it systematically tested edge cases. Configure it to explore different usage patterns to uncover hidden issues. The key workflow change: Stop guessing about optimizations. Instead, create a test harness with clear metrics, then delegate the exploration to Claude Code with the autoresearch skill. You'll get data-driven answers about what actually matters in your codebase.
Original sourcereddit.com

Trending Now

More in Products & Launches

View all