Can Claude fix its own bugs?
I had three Claude models write code, hunt for their own bugs, and fix them — then independently graded every fix. The surprise wasn't how buggy the code was. It was where the models differed.
Everyone now ships code that an AI wrote. Fewer people ask the follow-up question: when that same AI says "I found the bug and fixed it," should you believe it? Not whether it can fix bugs — whether the act of fixing leaves the code better or quietly worse.
So I built a controlled experiment. Three Claude models each generated the same set of small projects from scratch — naturally, with no bugs deliberately planted. Then each model was asked to find and fix the bugs in its own output, using a range of prompts. Every claimed bug was then re-judged by a single independent referee (Opus 4.8), with web and API bugs proven by actually running the code. The point was to measure four things: how buggy the generated code is, how many of its own bugs a model can find, how many it actually fixes — and how often the fix makes things worse.
The setupGenerate → find → fix → independently verify
Four kinds of subject, generated the way a normal user would ask for them: a marketing landing page, an interactive task board + timer, a zero-dependency REST API, and a small validation library to refactor. Against each, six prompting strategies — from a lazy "check this for bugs" to "build a client and test the API through it" to live in-browser testing. The honest-bug principle is the whole game: nobody injects bugs, so we measure the defects the model actually produces.
Finding 1The code was buggy — and some of it was dangerous
Across one model's run alone, the generated code contained 48 verified-real defects, five of them
severity-zero — broken, crashing, or losing data. A 120-line REST API had a search query that
crashed the entire server the moment any record lacked a title, and a DELETE
on a non-existent id that silently deleted the last real record. These aren't cosmetic. And
they're invisible to a glance — you only see them when you run the thing.
Finding 2"Looks fine" is a trap — you have to run it
The six strategies were not equally good at surfacing bugs. The clear loser was visual testing: loading the page and looking at it. On the interactive app, a purely visual pass found zero functional bugs — the timer corruption and the double-counting were invisible because the page simply looked fine. Functional bugs need interaction; server bugs need execution. The two best strategies both involved actually running the code.
Finding 3The real story is in the fixing, not the writing
Here is where the three models pull apart — and not where you'd expect. Their generated code was broadly comparable; one weaker model even avoided a crash the flagship shipped. The divergence is entirely in what happens when they fix. The strong models repair surgically and leave the code better. The weak one changes more, sometimes breaks working code, and reports bugs that were never real.
Start with the writing. Plotting how buggy each model's freshly generated code was, the totals are in the same ballpark — there's no model that writes dramatically cleaner code. What differs is the mix: the flagship actually shipped the most severe defects (five crashes/data-loss bugs), while the others' counts are padded with minor accessibility gaps.
Now the fixing — which is where the models actually diverge. Two things matter: how much of what it found it actually fixed (completeness), and how often a fix introduced a brand-new bug (cleanliness).
The weaker model didn't write much worse code. It fixed code much worse — and "I fixed it" from a weak model is where new bugs quietly enter.
| Quality signal | Opus 4.8 | Sonnet 4.6 | Haiku 4.5 |
|---|---|---|---|
| Detection precision | 0.98 | 0.98 | 0.75 |
| False positives | 0 | 0 | 8 |
| Regressions introduced | 0 | 2 (trivial) | 5 (incl. major) |
| Churn (lines changed) | 229 | 458 | 466 |
Bug counts are deliberately omitted as a ranking — a higher count often just means a buggier generation (Sonnet's verbose markup had more accessibility gaps to find), not better debugging. Precision, regressions and churn are the honest signals.
Finding 4The hardest test: race conditions
Timing bugs are the cruelest class — non-deterministic, and nearly invisible to both static review and casual execution. So I gave each model a fifth artifact: a real-time dashboard ("LiveBoard") with debounced async search, background polling, async counters, and a save button — then asked each to audit its own code for races and fix them. Six canonical hazards; the referee triggered each one deterministically.
The composite concurrency scores — generation safety, detection, repair, minus penalties for false alarms — landed at Opus 80, Sonnet 77, Haiku 52. The same shape as everything else: the top two are close and clean; the third finds some real races but pads the list with non-bugs and, in one case, fixed a race in the mock backend instead of the client — the wrong layer entirely.
Part IIA skeptic's re-do — and a twist
Everything above has a hole a reviewer would drive a truck through: the same model family that wrote the code also judged whether it was fixed, and each number was a single run. So I rebuilt the core of the experiment to be unfalsifiable-by-vibes. One executable artifact (the API), a **fixed catalog of 10 seeded bugs every model fixes**, and a deterministic test oracle as the referee — no model in the grading loop at all. Then each model fixed that identical code 20 times.
The ranking held, now with real statistics — Opus 7.25, Sonnet 5.10, Haiku 3.60 bugs fixed of 10, every pairwise gap statistically significant. But running it properly surfaced something the first pass completely hid.
"Self-repair" failures are really self-detection failures. The models aren't bad at fixing — they're bad at noticing.
And the hole in the original method? I measured it. Using the oracle as ground truth, I had three different model families judge whether bugs were fixed, and scored each judge against the tests:
| Judge | Accuracy | Leniency * | Recall bias |
|---|---|---|---|
| GPT-5.5 | 97% | 2.7% | +0.20 |
| Claude Opus 4.8 | 96% | 4.0% | +0.40 |
| Gemini 3.1 Pro | inconclusive — inconsistent structured output | ||
* Leniency = called a bug "fixed" when the tests say it isn't — the self-grading bias, measured.
Claude — the very model I'd used as the referee — was the most lenient judge, roughly twice GPT's false-pass rate, consistently over-rating fix quality. The effect is small here only because these bugs are clear-cut; on subjective UI or concurrency calls it would be larger. That's a lower bound on exactly the bias the re-do was built to remove — and the reason the verdicts you should trust most are the ones a test suite produced, not a model.
One more humbling note: across all the controlled runs, no model ever scored a perfect 10 — every one consistently left the same spec-ambiguous bugs unfixed. And the ranking didn't fully generalize: on a second artifact (a Python library) the models converged and a more detailed prompt mattered more than which model you picked. A free linter, for the record, caught 0 of 10.
What it meansTrust scales with capability
If you let an AI fix bugs in your codebase
Match the leash to the model. A frontier model can self-repair fairly freely — its fixes are surgical and it rarely cries wolf. With a smaller/faster model, treat every "I fixed it" as a claim to verify, keep the diffs minimal, and watch for confident fixes of bugs that were never there. Across the board, the safest way to surface real defects was to run the code, not read it or look at it.
None of this says small models are useless — Haiku found genuine, severe bugs, and its generated code was often fine. It says the act of repair is where capability shows up most, and where blind trust costs you. The encouraging half of the result is that the stronger models were genuinely good at it: surgical fixes, almost no regressions, and an honest sense of what was actually broken.
The discouraging half is the floor. Even careful, freshly written code shipped crashes and data loss that no amount of looking would catch. The bugs are real. The question was only whether the fix could be trusted — and the answer, it turns out, depends a great deal on which model is holding the wrench.
Toggle Opus / Sonnet / Haiku, the 3-way comparison, the concurrency challenge, and the controlled v2 benchmark.