How to Run 60 Code Experiments Overnight with Claude Code's Autoresearch Skill
A developer recently tested an "autoresearch" approach on a production Django search system using Claude Code. The results were surprising: 60 iterations, 57 reverted changes, and only 3 kept. But that 93% "failure rate" was the point.
The Technique: Autonomous Experimentation
The developer adapted Andrej Karpathy's autoresearch concept—previously used for ML training—to a real production codebase. Using a custom Claude Code skill, they set up an automated system where Claude would:
- Propose a hypothesis about improving search performance
- Implement the change in the Django/pgvector/Cohere system
- Run tests to measure impact
- Analyze results and decide whether to keep or revert
All while the developer was away from the keyboard.
Why It Works: Mapping the Ceiling
The key insight isn't about finding improvements—it's about discovering what doesn't work efficiently. In traditional development, you might:
- Spend hours manually testing an idea
- Get ambiguous results
- Wonder if you should try "just one more tweak"
With autoresearch, Claude systematically explores the solution space, providing concrete data on what's worth pursuing and what's a dead end.
In this case, the system proved:
- Title matching as a search signal? Net negative (proved in 2 iterations)
- Larger candidate pools? No effect (problem was ranking, not recall)
- Fiddling with keyword damping formulas? Scores barely moved
"You can stop tuning this" becomes a data-driven decision, not a gut feeling.
How To Apply It: The Open-Source Skill
The developer open-sourced their implementation: github.com/pjhoberman/autoresearch
To use it with Claude Code:
# Clone the skill repository
cd ~/.claude-code/skills
git clone https://github.com/pjhoberman/autoresearch.git
# Configure your experiment
cat > experiment_config.yaml << EOF
objective: "Improve search relevance scores"
metric: "ndcg@10" # Your evaluation metric
max_iterations: 30
change_types:
- "parameter_tuning"
- "algorithm_modification"
- "feature_addition"
EOF
# Run with Claude Code
claude code --skill autoresearch --config experiment_config.yaml
The skill works by:
- Reading your codebase and test suite
- Understanding your performance metrics
- Proposing targeted, incremental changes
- Running your tests automatically
- Learning from each iteration to propose better changes
Beyond ML: Production Code Applications
While originally designed for ML training, this approach works for any codebase with:
- Clear metrics (performance scores, response times, error rates)
- Automated tests to validate changes
- Version control for safe rollbacks
The developer discovered a Redis caching bug during their experiments: keys were hashed on query text but not prompt variations. This would have shipped to production unnoticed without systematic testing.
When To Use Autoresearch
Consider this approach when:
- You're optimizing a system with multiple tunable parameters
- You suspect there are diminishing returns to manual tuning
- You want to validate architectural decisions with data
- You need to explore a solution space faster than manual iteration allows
The Real Value: Knowledge, Not Just Improvements
The marginal score improvement was +0.03. The real value was learning:
- Which optimization avenues are dead ends
- Which hand-built components actually work (validating intuition)
- Where the true performance ceiling lies
As the developer noted: "Autoresearch maps where the ceiling is, not just the improvements." For Claude Code users, this means turning optimization from an art into a science—and getting concrete data on when to stop tweaking and start building something new.




