Lots of signs point to a conclusion that the Opus and Sonnet models are fundamen...

CuriouslyC · 2025-08-09T13:22:14 1754745734

I've been testing AI as a beta reader for >100k novels, and I can tell you with 100% certainty that Claude gets confused about things across long contexts much sooner than either O3/GPT5 or Gemini 2.5. In my experience Gemini 2.5 and O3/GPT5 run neck and neck until around 80-100k tokens, then Gemini 2.5 starts to pull ahead and by 150k tokens it's absolutely dominant. Claude is respectable but clearly in third place.

https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... https://longbench2.github.io/

itsafarqueue · 2025-08-10T15:18:43 1754839123

Really useful comment thanks. Reminder that LLMs aren’t just for coding.

libraryofbabel · 2025-08-08T18:06:43 1754676403

Yeah, agree that the benchmarks don't really seem to reflect the community consensus. I wonder if part of it is the better symbiosis between the agent (Claude Code) and the Opus and Sonnet models it uses, which supposedly are fine-tuned on Claude Code tool calls? But agree, there is probably some additional secret sauce in the training, perhaps to do with RL on multi-step problems...

pcwelder · 2025-08-09T05:45:03 1754718303

I get similar accuracy to claude code using claude desktop app with a file+bash mcp (different tools same performance).

My guess for why GPT5 scores more on benchmarks is that they evaluate on well defined tasks with all instructions given at the start.

Real life is multi turn. Multiple set of prompts to adhere to. This is where Claude is likely better.