AIAI · An experiment in testing AI

Can Claude fix its own bugs?

I had three Claude models write code, hunt for their own bugs, and fix them — then independently graded every fix. The surprise wasn't how buggy the code was. It was where the models differed.

Opus 4.8 · Sonnet 4.6 · Haiku 4.5 · 56 verified bugs across 42 test cells · June 2026

TL;DR. Freshly generated code is meaningfully buggy — including crashes and data loss. The real differentiator between models isn't writing code or even fixing it: told exactly what's broken, every model patches it perfectly. The gap is detection — noticing what's wrong. Stronger models find more, fix more cleanly, and cry wolf less. And when I checked the referee, the model I'd used to grade everything turned out to over-grade — so the numbers worth trusting came from a test suite, not a model.

Everyone now ships code that an AI wrote. Fewer people ask the follow-up question: when that same AI says "I found the bug and fixed it," should you believe it? Not whether it can fix bugs — whether the act of fixing leaves the code better or quietly worse.

So I built a controlled experiment. Three Claude models each generated the same set of small projects from scratch — naturally, with no bugs deliberately planted. Then each model was asked to find and fix the bugs in its own output, using a range of prompts. Every claimed bug was then re-judged by a single independent referee (Opus 4.8), with web and API bugs proven by actually running the code. The point was to measure four things: how buggy the generated code is, how many of its own bugs a model can find, how many it actually fixes — and how often the fix makes things worse.

The setupGenerate → find → fix → independently verify

Four kinds of subject, generated the way a normal user would ask for them: a marketing landing page, an interactive task board + timer, a zero-dependency REST API, and a small validation library to refactor. Against each, six prompting strategies — from a lazy "check this for bugs" to "build a client and test the API through it" to live in-browser testing. The honest-bug principle is the whole game: nobody injects bugs, so we measure the defects the model actually produces.

Each model runs the same matrix. A coloured cell is a strategy that applies to that artifact — plus a dedicated concurrency challenge (below). Every fix is graded by a constant referee, not self-reported.

Finding 1The code was buggy — and some of it was dangerous

Across one model's run alone, the generated code contained 48 verified-real defects, five of them severity-zero — broken, crashing, or losing data. A 120-line REST API had a search query that crashed the entire server the moment any record lacked a title, and a DELETE on a non-existent id that silently deleted the last real record. These aren't cosmetic. And they're invisible to a glance — you only see them when you run the thing.

Generated code isn't lightly buggy. The defects cluster in logic and error handling, where a reading-only review is weakest.

Finding 2"Looks fine" is a trap — you have to run it

The six strategies were not equally good at surfacing bugs. The clear loser was visual testing: loading the page and looking at it. On the interactive app, a purely visual pass found zero functional bugs — the timer corruption and the double-counting were invisible because the page simply looked fine. Functional bugs need interaction; server bugs need execution. The two best strategies both involved actually running the code.

Static "check for bugs" is a genuine signal (and cheap). But the severe bugs only fell out when the code was executed. Screenshots are for polish, not correctness.

Finding 3The real story is in the fixing, not the writing

Here is where the three models pull apart — and not where you'd expect. Their generated code was broadly comparable; one weaker model even avoided a crash the flagship shipped. The divergence is entirely in what happens when they fix. The strong models repair surgically and leave the code better. The weak one changes more, sometimes breaks working code, and reports bugs that were never real.

Start with the writing. Plotting how buggy each model's freshly generated code was, the totals are in the same ballpark — there's no model that writes dramatically cleaner code. What differs is the mix: the flagship actually shipped the most severe defects (five crashes/data-loss bugs), while the others' counts are padded with minor accessibility gaps.

Generation quality is broadly comparable — no model writes much cleaner code. The honest signal is the severity mix: Opus shipped five S0 crash/data-loss bugs (most dangerous); Sonnet's higher total is mostly minor accessibility gaps in verbose markup. Counts are cross-strategy detections, so treat them as a relative gauge, not a unique-defect census.

Now the fixing — which is where the models actually diverge. Two things matter: how much of what it found it actually fixed (completeness), and how often a fix introduced a brand-new bug (cleanliness).

Two axes of "fixability." Clean fixes is the real story: Opus 100% (every fix safe), Sonnet 97%, Haiku 86% — one in seven Haiku fixes broke something. Opus's lower completeness (77%) is deliberate restraint — it left trivial bugs alone in a single surgical pass — not an inability to fix; on the serious (S0/S1) bugs every model fixed nearly all.

Opus introduced zero regressions; Sonnet two (both trivial); Haiku five — including one that hid the mobile call-to-action entirely. Haiku also reported eight bugs that didn't exist, and chasing non-bugs is what caused several of its regressions.

The weaker model didn't write much worse code. It fixed code much worse — and "I fixed it" from a weak model is where new bugs quietly enter.

Quality signal	Opus 4.8	Sonnet 4.6	Haiku 4.5
Detection precision	0.98	0.98	0.75
False positives	0	0	8
Regressions introduced	0	2 (trivial)	5 (incl. major)
Churn (lines changed)	229	458	466

Bug counts are deliberately omitted as a ranking — a higher count often just means a buggier generation (Sonnet's verbose markup had more accessibility gaps to find), not better debugging. Precision, regressions and churn are the honest signals.

Finding 4The hardest test: race conditions

Timing bugs are the cruelest class — non-deterministic, and nearly invisible to both static review and casual execution. So I gave each model a fifth artifact: a real-time dashboard ("LiveBoard") with debounced async search, background polling, async counters, and a save button — then asked each to audit its own code for races and fix them. Six canonical hazards; the referee triggered each one deterministically.

Every model shipped the stale-search race (H1) except Opus, which wrote a sequence-guard unprompted. After auditing, all three found and fixed their races — but only the strong models did so without raising false alarms.

The composite concurrency scores — generation safety, detection, repair, minus penalties for false alarms — landed at Opus 80, Sonnet 77, Haiku 52. The same shape as everything else: the top two are close and clean; the third finds some real races but pads the list with non-bugs and, in one case, fixed a race in the mock backend instead of the client — the wrong layer entirely.

Part IIA skeptic's re-do — and a twist

Everything above has a hole a reviewer would drive a truck through: the same model family that wrote the code also judged whether it was fixed, and each number was a single run. So I rebuilt the core of the experiment to be unfalsifiable-by-vibes. One executable artifact (the API), a **fixed catalog of 10 seeded bugs every model fixes**, and a deterministic test oracle as the referee — no model in the grading loop at all. Then each model fixed that identical code 20 times.

The ranking held, now with real statistics — Opus 7.25, Sonnet 5.10, Haiku 3.60 bugs fixed of 10, every pairwise gap statistically significant. But running it properly surfaced something the first pass completely hid.

Told exactly which bug to fix, all three models fixed 10/10 with zero regressions. Every model can repair every defect — the entire gap is in finding the bugs. The weaker the model, the bigger its blind spot (Haiku misses 6.4 bugs it could fix if told).

"Self-repair" failures are really self-detection failures. The models aren't bad at fixing — they're bad at noticing.

And the hole in the original method? I measured it. Using the oracle as ground truth, I had three different model families judge whether bugs were fixed, and scored each judge against the tests:

Judge	Accuracy	Leniency *	Recall bias
GPT-5.5	97%	2.7%	+0.20
Claude Opus 4.8	96%	4.0%	+0.40
Gemini 3.1 Pro	inconclusive — inconsistent structured output

* Leniency = called a bug "fixed" when the tests say it isn't — the self-grading bias, measured.

Claude — the very model I'd used as the referee — was the most lenient judge, roughly twice GPT's false-pass rate, consistently over-rating fix quality. The effect is small here only because these bugs are clear-cut; on subjective UI or concurrency calls it would be larger. That's a lower bound on exactly the bias the re-do was built to remove — and the reason the verdicts you should trust most are the ones a test suite produced, not a model.

One more humbling note: across all the controlled runs, no model ever scored a perfect 10 — every one consistently left the same spec-ambiguous bugs unfixed. And the ranking didn't fully generalize: on a second artifact (a Python library) the models converged and a more detailed prompt mattered more than which model you picked. A free linter, for the record, caught 0 of 10.

What it meansTrust scales with capability

If you let an AI fix bugs in your codebase

Match the leash to the model. A frontier model can self-repair fairly freely — its fixes are surgical and it rarely cries wolf. With a smaller/faster model, treat every "I fixed it" as a claim to verify, keep the diffs minimal, and watch for confident fixes of bugs that were never there. Across the board, the safest way to surface real defects was to run the code, not read it or look at it.

None of this says small models are useless — Haiku found genuine, severe bugs, and its generated code was often fine. It says the act of repair is where capability shows up most, and where blind trust costs you. The encouraging half of the result is that the stronger models were genuinely good at it: surgical fixes, almost no regressions, and an honest sense of what was actually broken.

The discouraging half is the floor. Even careful, freshly written code shipped crashes and data loss that no amount of looking would catch. The bugs are real. The question was only whether the fix could be trusted — and the answer, it turns out, depends a great deal on which model is holding the wrench.

Go deeper

Explore every number in the interactive dashboard

Toggle Opus / Sonnet / Haiku, the 3-way comparison, the concurrency challenge, and the controlled v2 benchmark.

Open the dashboard →