Nick Saraev demonstrates how to combine Claude Code skills with Andrej Karpathy's autoresearch pattern to dramatically improve skill reliability. The core insight: instead of manually fixing skills when they fail, you set up an automated optimization loop that uses evals to measure quality and iteratively improves the skill prompt overnight.
Derived from Karpathy's autoresearch GitHub repo, this pattern uses three ingredients: the thing being optimized (your skill.md), evaluation criteria (how to measure if the output is good), and an optimization agent that iterates between running the skill and improving it based on eval results.
The eval is the most critical component. It defines what "good output" looks like in concrete, measurable terms. Bad evals are vague ("make it better"). Good evals are specific ("output must contain exactly 5 sections, each under 200 words, with at least one code example").
The agent runs the skill, evaluates the output against your criteria, identifies where it falls short, modifies the skill prompt to address the gaps, and repeats. Each iteration produces a measurably better version of the skill.
Autoresearch transforms skill development from a manual debugging process into an automated optimization loop. By defining clear evaluation criteria and letting an agent iteratively improve the skill prompt, you can achieve significantly higher reliability without spending your own time on fixes.