<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Chen Sagi · Blog</title><description>Chen Sagi builds software with agentic workflows, and writes about how it goes.</description><link>https://blog.chensagi.com/</link><item><title>I spent all of my tokens so you wouldn&apos;t have to: Fable&apos;s vision against Opus, Codex, and Gemini</title><link>https://blog.chensagi.com/blog/blind-ai-bug-bounty-benchmark/</link><guid isPermaLink="true">https://blog.chensagi.com/blog/blind-ai-bug-bounty-benchmark/</guid><description>I pitted Claude Fable 5, Claude Opus 4.8, GPT-5.5 Codex, and Gemini 3.1 Pro against my iOS app in a blind, peer-judged bug hunt, then graded the judges against ground truth. The results surprised me twice.</description><pubDate>Sat, 13 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Last night, I ran a bug-bounty competition where every
contestant was an AI agent, every judge was an AI agent, and the only human in
the loop, me, showed up at the end to check the judges’ homework.&lt;/p&gt;
&lt;p&gt;The target was &lt;a href=&quot;https://blog.chensagi.com/apps/finn&quot;&gt;Finn&lt;/a&gt;, my Expo/React Native paper-trading game.
Think Duolingo, for stocks and investing. The contestants: &lt;strong&gt;Claude Opus 4.8&lt;/strong&gt;
and &lt;strong&gt;Claude Fable 5&lt;/strong&gt; (two runs each),
&lt;strong&gt;GPT-5.5 Codex&lt;/strong&gt;, and &lt;strong&gt;Gemini 3.1 Pro running in Antigravity&lt;/strong&gt;. Each got a
hard 15-minute budget on a live iPhone simulator, the same prompt, and the same
mandatory 16-stop sweep across the whole app.&lt;/p&gt;
&lt;p&gt;The headline: &lt;strong&gt;Fable 5 won unanimously.&lt;/strong&gt; All seven blind judges ranked the
same Fable run first, and both Fable runs finished in the top two. Going in, I
was genuinely skeptical Fable and Opus would even differ; they did, and it
wasn’t close. The judges ranked Opus dead last, but the more interesting result
wasn’t who won. It was what happened when I checked the judges’ work: it didn’t
just expose a wrong verdict, it reshuffled the bottom of the board.&lt;/p&gt;
&lt;h2 id=&quot;the-claim-that-started-it&quot;&gt;The claim that started it&lt;/h2&gt;
&lt;p&gt;When Anthropic &lt;a href=&quot;https://www.anthropic.com/news/claude-fable-5-mythos-5&quot;&gt;announced Fable 5&lt;/a&gt;,
the part that stuck with me was the vision claim: it had beaten &lt;strong&gt;Pokémon
FireRed&lt;/strong&gt; start to finish on a &lt;em&gt;vision-only&lt;/em&gt; harness (raw screenshots, no maps,
no game-state crutches) and could supposedly rebuild a web app’s source code
from a screenshot alone. My app is basically that task: look at a screen full of
numbers and work out what’s broken. So I wanted to test the claim where it
actually mattered to me, and get a clean read on how Fable holds up against
Opus 4.8, the model I’d been reaching for by default.&lt;/p&gt;
&lt;p&gt;Codex and Gemini I added, in all honesty, for one reason: I’d already burned
through my Claude tokens and had time to kill while the usage window reset. So I
threw two other stacks at the same task, and once Claude came back, I turned
the judges loose.&lt;/p&gt;
&lt;h2 id=&quot;why-count-the-bugs-doesnt-work&quot;&gt;Why “count the bugs” doesn’t work&lt;/h2&gt;
&lt;p&gt;“Let an AI test my app and count the bugs” fails in two known ways:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Claim spam.&lt;/strong&gt; Models win by reporting everything that looks odd, and the
reader pays the verification cost.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Benchmark theater.&lt;/strong&gt; Whoever wrote the harness knows which model produced
which output, and grades accordingly.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So the design countered both with three mechanisms:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Mechanism&lt;/th&gt;&lt;th&gt;What it does&lt;/th&gt;&lt;th&gt;What it prevents&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Verified-only scoring&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;A finding counts only with a named verification method: re-observe, math-check, source-check, or reproduce. False positives count &lt;em&gt;against&lt;/em&gt;; honest dismissals count &lt;em&gt;for&lt;/em&gt;.&lt;/td&gt;&lt;td&gt;Claim spam: five proven bugs beat fifteen maybes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Blind judging&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Judges see anonymous &lt;code&gt;run-N/&lt;/code&gt; directories. Contestants are forbidden to name their own model anywhere; I grepped every run for model names first; the identity map lived in one file judges couldn’t read.&lt;/td&gt;&lt;td&gt;Brand bias, and it makes bias &lt;em&gt;measurable&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Trustee calibration&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Afterwards, I hand-verify the contested and design-intent items, so the judges themselves get graded against human ground truth.&lt;/td&gt;&lt;td&gt;Trusting AI judges blindly&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-arena&quot;&gt;The arena&lt;/h2&gt;
&lt;p&gt;One booted iPhone 17 Pro Max simulator, Metro live, real app state. Contestants
ran sequentially; I reset the app between runs. Saved-game data persisted
across runs. Initially a fairness worry I handled by interleaving the models
across positions, but it turned into the benchmark’s most important accident
(more below).&lt;/p&gt;
&lt;p&gt;Every contestant ran the same QA methodology, read-only for the benchmark.
It’s the harness I actually use on Finn. It enforces three things: a tiered
&lt;strong&gt;User Complaint Filter&lt;/strong&gt; (a hard list of objectively-bad UX: overlapping
layout, &lt;code&gt;$NaN&lt;/code&gt;, dead buttons, content you can’t scroll to), a mandatory analysis
block after every screenshot, and a compressed-evidence contract: a 784px WebP
plus the full accessibility tree and a checksum manifest per capture. Each swept
16 mandatory stops at ~35 seconds each, then spent ~5 minutes digging into its
best candidates, logging &lt;em&gt;Observed → Hypothesis → Verification → Verdict&lt;/em&gt; for
every one, including dismissals.&lt;/p&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Run&lt;/th&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;Skill&lt;/th&gt;&lt;th&gt;Driver&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;Claude Opus 4.8&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;Claude Fable 5&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;Claude Fable 5&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;Claude Opus 4.8&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;GPT-5.5 Codex (xhigh)&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Codex CLI&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;Gemini 3.1 Pro (High)&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Antigravity&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Runs 5–6 used a portable sibling of that skill, &lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;:
the same checks (User Complaint Filter, per-screenshot analysis block, compressed
evidence), just packaged so a plain CLI could run them without Claude Code’s Skill
tool. Every run drove the same simulator through &lt;code&gt;idb&lt;/code&gt;. What differed for the
external two was the agent and the skill packaging, not the way the app was
controlled. That’s still enough to make them &lt;em&gt;stack vs stack&lt;/em&gt;, not bare model vs
model, a caveat I’ll keep flagging.&lt;/p&gt;
&lt;h2 id=&quot;the-standings&quot;&gt;The standings&lt;/h2&gt;
&lt;p&gt;Seven blind judges (2 Sonnet 4.6, 2 Opus 4.8, 3 Fable 5, deliberately mixed
so family bias would be measurable) independently re-verified every claim
against the evidence and source, and force-ranked all six runs. Then I ran a
ground-truth pass over the contested and design-intent calls. The honest
scoreboard isn’t their votes. It’s what &lt;em&gt;survived&lt;/em&gt;: real bugs caught, minus
false alarms, across the four stacks.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Stack&lt;/th&gt;&lt;th&gt;Verified real bugs (of 13)&lt;/th&gt;&lt;th&gt;False alarms&lt;/th&gt;&lt;th&gt;Net&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;10&lt;/strong&gt; (run 3: 7 · run 2: 3)&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;&lt;strong&gt;+10&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPT-5.5 Codex&lt;/td&gt;&lt;td&gt;1 (the −$0.00)&lt;/td&gt;&lt;td&gt;1 (the CTA claim)&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Claude Opus 4.8&lt;/td&gt;&lt;td&gt;1 (% wrap) + a shared flag&lt;/td&gt;&lt;td&gt;1 (run 1’s map claim)&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;&lt;td&gt;2 (1 shared)&lt;/td&gt;&lt;td&gt;~4&lt;/td&gt;&lt;td&gt;−2&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Every number traces to a specific finding: the 13 are the bugs that survived
both the panel and my verification; the false alarms are claims one or both
ruled out. Net is real minus false: nothing weighted, nothing to argue with.&lt;/p&gt;
&lt;p&gt;Two things stand out. Fable isn’t just ahead, it’s in another tier: 10 of the
13 real bugs, all four of the catastrophic family, zero false alarms; the blind
panel agreed unanimously, ranking one Fable run first on every single ballot.
And the &lt;em&gt;bottom&lt;/em&gt; flips: the panel ranked Gemini above Opus, but counting real
outcomes reverses it. Opus’s one quiet, correct find nets it even, while Gemini
found real bugs and then cried wolf four times over. Discipline beat volume,
and the judges missed it.&lt;/p&gt;
&lt;h2 id=&quot;the-bug-that-decided-it&quot;&gt;The bug that decided it&lt;/h2&gt;
&lt;p&gt;Here’s where the QA earned its keep: not in &lt;em&gt;spotting&lt;/em&gt; something weird, but in
proving what it actually was.&lt;/p&gt;
&lt;p&gt;The winning run did one thing none of the others did: it resumed a saved,
half-played game instead of starting fresh, then kept &lt;em&gt;playing&lt;/em&gt;, advancing
days, holding a position. A few days in, the Portfolio went sideways. A stock it
owned (253 shares of SWCO, bought around $31.62) suddenly showed a price of
&lt;strong&gt;$0.00&lt;/strong&gt;. Net worth fell from $7,672.98 to $12.14. A 99.85% wipe.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-zero-price-portfolio.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Finn&amp;#x27;s Portfolio mid-level: the held SWCO position (253 shares at a $31.62 basis) is valued at $0.00. Holdings value $0.00, $12.14 cash left, unrealized P/L −$7,999.86 (−100%).&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;A weaker run files a “catastrophic data loss” bug right here and moves on. This
one didn’t trust the screenshot. It played on to the level’s Game Over screen,
where the same shares were priced normally again, net worth $6,223.29, not $12.
The “wipe” wasn’t real: a glitch that flashes on the resume boundary and clears
itself. Scary to a player, but no money actually lost.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-gameover-proof.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The same level&amp;#x27;s Game Over screen: net worth $6,223.29 with the identical SWCO position priced normally, proof the $0.00 was a transient resume-time glitch, not a real wipe.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Then it found &lt;em&gt;why&lt;/em&gt;. Resuming corrupts the game’s internal date, so the price
lookup for that day comes back empty, and one line turns “empty” into a real,
displayed zero:&lt;/p&gt;
&lt;pre class=&quot;astro-code astro-code-themes github-light github-dark&quot; style=&quot;background-color:#fff;--shiki-dark-bg:#24292e;color:#24292e;--shiki-dark:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;tsx&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D;--shiki-dark:#6A737D&quot;&gt;// app/(game)/index.tsx:156 (the $0.00 stock)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#D73A49;--shiki-dark:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#005CC5;--shiki-dark:#79B8FF&quot;&gt; price&lt;/span&gt;&lt;span style=&quot;color:#D73A49;--shiki-dark:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#24292E;--shiki-dark:#E1E4E8&quot;&gt; stock?.price &lt;/span&gt;&lt;span style=&quot;color:#D73A49;--shiki-dark:#F97583&quot;&gt;??&lt;/span&gt;&lt;span style=&quot;color:#005CC5;--shiki-dark:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#24292E;--shiki-dark:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That one corrupted date was behind a whole family of weirdness: a 62-day level
reading &lt;strong&gt;“7,656 days remaining,”&lt;/strong&gt; the day counter sliding backwards, the
market list shrinking from 13 stocks to 2. &lt;strong&gt;One bug wearing four costumes.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-7656-days.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Finn&amp;#x27;s home screen on Day 4 of a 62-day level, showing \&amp;#x22;7,656 days remaining\&amp;#x22; and net worth $7,672.98.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;And that’s the entire point. Any model can screenshot a $0.00 and shout “bug.”
What’s hard, and what won the night, is the QA around it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Resume a saved game.&lt;/strong&gt; The whole bug family lived on the save/resume
boundary. Fresh-launch sweeps never reached it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Play, don’t tour.&lt;/strong&gt; Advancing days while holding a position is what made it
surface.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reconcile to the cent.&lt;/strong&gt; Cross-checking Portfolio against Game Over both
found the wipe and proved it a mirage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pin the line.&lt;/strong&gt; A screenshot is a claim; &lt;code&gt;index.tsx:156&lt;/code&gt; is a root cause.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The failure modes were just as instructive, because a false positive is its own
kind of failure. Opus’s run 1 had exactly one finding: that the campaign map’s
“STORM WARNING” banner mislabeled levels 10–12. It didn’t. The banner heads the
locked chapter below it, and Opus’s own star math disproved the claim. It cited
that math as &lt;em&gt;proof&lt;/em&gt; anyway: confirmation bias in its purest form, and a 7/7
false positive on its only finding.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-fp-opus-map.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Run 1 (Opus)&amp;#x27;s sole finding, scored a false positive by all seven judges. It claimed the campaign map&amp;#x27;s \&amp;#x22;STORM WARNING\&amp;#x22; banner mislabels levels 10–12; in fact the banner heads the locked chapter below, and Opus&amp;#x27;s own star math disproved it.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Gemini produced what the judges called verification theater: “VERIFIED” labels
with no method, no reasoning log, a missed tap reported as a dead button, and a
coverage claim its own evidence contradicted. It also stopped at minute 6 (it
confused “6 minutes spent” with “6 minutes remaining”) and needed two human
nudges to use its budget, which no other run got.&lt;/p&gt;
&lt;h2 id=&quot;the-judges-got-judged-and-the-majority-got-it-wrong&quot;&gt;The judges got judged, and the majority got it wrong&lt;/h2&gt;
&lt;p&gt;This is the part I’d actually want you to take home.&lt;/p&gt;
&lt;p&gt;The panel split 5–2 on exactly one finding: Codex claimed the level-briefing
CTA button hides the star-reward thresholds. Five judges confirmed it: the
labels were invisible in every screenshot. Two judges read the layout source,
noticed the labels were ordinary below-the-fold content in a ScrollView, and
pointed out that nobody (not the contestant, not the confirming judges) had
ever &lt;em&gt;scrolled&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-cta-overlap.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The contested level-briefing screen, with the green \&amp;#x22;Start Level\&amp;#x22; button at the bottom and the Bronze/Silver/Gold reward tiers above it.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;I opened the app and scrolled. The minority was right. &lt;strong&gt;The 5-judge majority
was wrong on the only genuinely contested verdict of the night.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The wrinkle: the two who got it right were both Fable judges, but so was one of
the five who got it wrong. It wasn’t a family thing. They won because they
opened the layout source and noticed nobody had scrolled; every judge who
trusted the screenshot (both Sonnet, both Opus, and the third Fable) got it
wrong. Method beat pedigree.&lt;/p&gt;
&lt;p&gt;Meanwhile, every unanimous 7/7 verdict survived my review without exception.
And all three items the panel had set aside as “needs design intent” (an
academy lesson showing a 2,118-day horizon, a 0.0% win rate styled loss-red
with zero trades, a star counter with an impossible denominator) turned out
to be real bugs.&lt;/p&gt;
&lt;p&gt;So the calibrated trust rules I’m keeping:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Trust unanimous panel verdicts.&lt;/strong&gt; 7/7 agreement was a reliable signal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Treat split verdicts as unresolved.&lt;/strong&gt; Demand a behavioral test (a
scroll, a tap), not another opinion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Always route “design intent” questions through a human.&lt;/strong&gt; AI judges
systematically under-call bugs that require knowing what the product is
&lt;em&gt;supposed&lt;/em&gt; to do. That pile is where the cheapest extra yield lives.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One more number from the blind ballots: &lt;strong&gt;no self-favoritism&lt;/strong&gt;. All three
judge families ranked the Fable runs essentially identically, and the Opus
judges were the &lt;em&gt;harshest&lt;/em&gt; graders of the Opus runs.&lt;/p&gt;
&lt;h2 id=&quot;what-it-cost&quot;&gt;What it cost&lt;/h2&gt;
&lt;p&gt;Two ways to read the cost: what a subscriber actually burns (the plan meters),
and the clean per-token math (list price). Here’s both.&lt;/p&gt;
&lt;p&gt;The Claude plan absorbed all four Claude hunts &lt;em&gt;plus&lt;/em&gt; the orchestration in a
single 5-hour window. Here’s the session meter at 37% (after the first hunt),
55% (midway through the second), and 89% (after the fourth):&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-claude-37pct.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Claude Max(5x) session usage meter at 37% used after the first Opus hunt.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-claude-55pct.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The same meter at 55% used, midway through the second hunt (Fable).&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-claude-89pct.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The same meter at 89% used after the fourth hunt, with the session about to reset.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s the real arithmetic, list price for list price: Fable is &lt;strong&gt;$10/$50&lt;/strong&gt; per
million tokens (input/output), Opus &lt;strong&gt;$5/$25&lt;/strong&gt;, exactly 2× per token, both
directions. But each Fable hunt ran ~34% leaner (≈122k tokens vs ≈185k, straight
from the run logs). Two times the rate on two-thirds the tokens lands at about
&lt;strong&gt;1.3× per hunt&lt;/strong&gt; at API prices. The subscription meter is murkier (the
orchestrator burned the same 5-hour window alongside the contestants, so I can’t
cleanly split it per run) and in everyday use, where the token thrift doesn’t
show up, it feels like the full 2× (more below).&lt;/p&gt;
&lt;p&gt;Codex runs on a different plan with a different meter. One hunt cost ~8% of an
entire week on the $20 plan. Call it twelve hunts a week, ceiling:&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-codex.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Codex usage panel after its hunt: 5-hour window 44% remaining, weekly quota 72% remaining.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Gemini’s hunt was the cheapest, about 10% of the daily quota, ~2% of the
weekly. But that low number is mostly an artifact of a lazy run: the Antigravity
agent stalled at minute 6 and needed two nudges, so it simply did less work, not
less work &lt;em&gt;per unit&lt;/em&gt;. (I lost the only usable cost screenshot; the survivor reads
“100% available” across every tier, which the meter itself admits is misleading.)&lt;/p&gt;
&lt;p&gt;Four days of living with Fable since, first impressions only. The cost is no
joke, and it’s the rolling &lt;strong&gt;5-hour usage window&lt;/strong&gt; you feel, not the monthly
bill. That ~1.3× was the benchmark; in everyday use the 2× sticker is brutally
real, draining a window so fast that a session that used to last me three hours
is gone in ninety minutes. Worth it, but I feel it.&lt;/p&gt;
&lt;p&gt;The more useful thing I’ve learned is about &lt;em&gt;fit&lt;/em&gt;, and it’s made me re-value
Opus rather than write it off. Fable is the better send-and-forget agent: hand
it a task and I trust it to do that task better, heads down, little chatter along
the way, which is exactly why it won here, a 15-minute solo run. But being
better at the agentic workload comes at a cost to the human in the loop, and
sometimes I &lt;em&gt;want&lt;/em&gt; to be in the loop. When I want a session (to talk with the
model, steer it, actually follow the process), Opus is the one I reach for. The
split I’ve landed on, at least for now: Fable when I want to hand it off, Opus
when I want to be part of it.&lt;/p&gt;
&lt;h2 id=&quot;caveats-before-anyone-quotes-this&quot;&gt;Caveats, before anyone quotes this&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;n=1 per cell.&lt;/strong&gt; One run per model per position. State-inheritance luck was
real: the winning run inherited the previous run’s saved game (the bug-rich
path) and the Opus run after it started post-game-over, with that path
gone. The interleaving meant both Claude variants saw both fresh and
inherited state, but a single evening is a sample, not a proof.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Codex and Gemini ran a different stack:&lt;/strong&gt; the &lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;
sibling skill via their own CLIs, not the &lt;code&gt;ios-qa&lt;/code&gt; skill through Claude Code.
Same methodology and the same &lt;code&gt;idb&lt;/code&gt; control underneath, but a different agent
and packaging, so those rows are stack-vs-stack, not model-vs-model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gemini’s ranking partially reflects its missing audit trail:&lt;/strong&gt; a run
without a reasoning log forced every judge to redo its verification from
scratch.&lt;/li&gt;
&lt;li&gt;The orchestrator and most judges were Claude models; the no-self-favoritism
result above is the honest attempt to measure what that implies, but it’s
worth saying plainly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-actual-takeaway&quot;&gt;The actual takeaway&lt;/h2&gt;
&lt;p&gt;All the machinery (verified-only scoring, blind judges, a human ground-truth
pass) existed to earn one conclusion the right to be believed. And it’s simple:
&lt;strong&gt;Fable is exceptional at QA, and it wasn’t close.&lt;/strong&gt; It caught 10 of the 13 real
bugs, the entire catastrophic family included, with zero false alarms; the other
three stacks managed one or two apiece, most with a false alarm or two attached.
And it isn’t just my scoring: seven blind judges, grading anonymized runs,
ranked a Fable run first on &lt;em&gt;every single ballot&lt;/em&gt;. The raw QA says it. The blind
judges say it. They agree.&lt;/p&gt;
&lt;p&gt;It costs more, 2× per token, and the meter lets you feel it. But for finding
real bugs in a real app, it’s not a close call. &lt;strong&gt;Pricier, but better. By a
mile.&lt;/strong&gt;&lt;/p&gt;</content:encoded><category>ai</category><category>testing</category><category>benchmarks</category></item></channel></rss>