Stop Fixing Your Claude Skills. Autoresearch Does It For You

Study Guide

Overview

Nick Saraev demonstrates how to combine Claude Code skills with Andrej Karpathy's autoresearch pattern to dramatically improve skill reliability. The core insight: instead of manually fixing skills when they fail, you set up an automated optimization loop that uses evals to measure quality and iteratively improves the skill prompt overnight.

Key Concepts

The Autoresearch Pattern

Derived from Karpathy's autoresearch GitHub repo, this pattern uses three ingredients: the thing being optimized (your skill.md), evaluation criteria (how to measure if the output is good), and an optimization agent that iterates between running the skill and improving it based on eval results.

Writing Effective Evals

The eval is the most critical component. It defines what "good output" looks like in concrete, measurable terms. Bad evals are vague ("make it better"). Good evals are specific ("output must contain exactly 5 sections, each under 200 words, with at least one code example").

The Optimization Loop

The agent runs the skill, evaluates the output against your criteria, identifies where it falls short, modifies the skill prompt to address the gaps, and repeats. Each iteration produces a measurably better version of the skill.

Practical Takeaways

Don't manually debug skills; set up an autoresearch loop to improve them automatically
Write specific, measurable evals rather than vague quality criteria
The pattern works for any repeatable process, not just Claude Code skills
Start with your most-used but least-reliable skill for maximum impact
Let the optimization run overnight for hands-free improvement

Summary

Autoresearch transforms skill development from a manual debugging process into an automated optimization loop. By defining clear evaluation criteria and letting an agent iteratively improve the skill prompt, you can achieve significantly higher reliability without spending your own time on fixes.