I use Claude Code regularly and have been responsible for introducing colleagues...

paulhodge · 2025-08-08T15:48:33 1754668113

Lots of signs point to a conclusion that the Opus and Sonnet models are fundamentally better at coding, tool usage, and general problem solving across long contexts. There is some kind of secret sauce in the way they train the models. Dario has mentioned in interviews that this strength is one of the company's closely guarded secrets.

And I don't think we have a great eval benchmark that exactly measures this capability yet. SWE Bench seems to be pretty good, but there's already a lot of anecdotal comments that Claude is still better at coding than GPT 5, despite having similar scores on SWE Bench.

CuriouslyC · 2025-08-09T13:22:14 1754745734

I've been testing AI as a beta reader for >100k novels, and I can tell you with 100% certainty that Claude gets confused about things across long contexts much sooner than either O3/GPT5 or Gemini 2.5. In my experience Gemini 2.5 and O3/GPT5 run neck and neck until around 80-100k tokens, then Gemini 2.5 starts to pull ahead and by 150k tokens it's absolutely dominant. Claude is respectable but clearly in third place.

https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... https://longbench2.github.io/

itsafarqueue · 2025-08-10T15:18:43 1754839123

Really useful comment thanks. Reminder that LLMs aren’t just for coding.

libraryofbabel · 2025-08-08T18:06:43 1754676403

Yeah, agree that the benchmarks don't really seem to reflect the community consensus. I wonder if part of it is the better symbiosis between the agent (Claude Code) and the Opus and Sonnet models it uses, which supposedly are fine-tuned on Claude Code tool calls? But agree, there is probably some additional secret sauce in the training, perhaps to do with RL on multi-step problems...

pcwelder · 2025-08-09T05:45:03 1754718303

I get similar accuracy to claude code using claude desktop app with a file+bash mcp (different tools same performance).

My guess for why GPT5 scores more on benchmarks is that they evaluate on well defined tasks with all instructions given at the start.

Real life is multi turn. Multiple set of prompts to adhere to. This is where Claude is likely better.

CamouflagedKiwi · 2025-08-08T15:38:05 1754667485

Not a power user, but most recently I tried it out against Gemini and Claude produced something that compiled and almost worked - it was off in some specifics that I could easily tweak. The next thing I asked it (with slightly more detailed prompting) it more or less just nailed.

Meanwhile Gemini got itself stuck in a loop of compile/fail/try to fix/compile/fail again. Eventually it just gave up and said "I'm not able to figure this out". It does seem to have a kind of self-esteem problem in these scenarios, whereas Claude is more bullish on itself (maybe not always a good thing).

Claude seems to be the best at getting something that actually works. I do think Gemini will end up being tough competition, if nothing else because of the price, but Google really need a bit of a quality push on it. A free AI agent is worthless if it can't solve anything for me.

itsafarqueue · 2025-08-10T15:23:17 1754839397

The doom loop Gemini gets into is genuinely unpleasant to read.

“I’m so stupid. I should be ashamed of myself. I’m such a loser. Idiot, idiot. Oh god I suck. I’m an embarrassment.”

The torture Google must RL on this model, man.

aosaigh · 2025-08-08T15:30:58 1754667058

I mentioned this is another comment, but for me one of the big positives is nothing to do with the model, it’s the UI of how it presents itself.

I hated at first that it wasn’t like Cursor, sitting in the IDE. Then I realised I was using Cursor completely differently, using it often for small tasks where it’s only moderately helpful (refactoring, adding small functions, autocompleting)

With Claude I have to stop, think and plan before engaging with it, meaning it delivers much more impactful changes.

Put another way, it demands more from me meaning I treat it with more respect and get more out of it

libraryofbabel · 2025-08-08T15:37:50 1754667470

This is a good point, the CLI kind of forces you to engage with the coding process through the eyes of the agent, rather than just treating it as “advanced autocomplete” in the IDE.

However, there are a lot of Claude Code clones out there now that are basically the same (Gemini CLI, Codex, now Cursor CLI etc.). Claude still seems to lead the pack, I think? Perhaps it’s some combination of better coding performance due to the underlying LLM (usually Sonnet 4) being fine-tuned on the agent tool calls, plus Claude is just a little more mature in terms of configuration options etc.?

enobrev · 2025-08-08T15:44:32 1754667872

I haven't tried codex or cursor-cli yet, but I have tried to give gemini a few tasks and in my experience, compared to claude code, it's not great.

Gemini's been very quick to dive in and start changing things, even when I don't want it to. But those changes almost always fall short of what I'm after. They don't run or they leave failing tests, and when I ask it to fix the tests or the underlying issue, it churns without success. Claude is significantly slower and definitely not right all the time, but it seems to do a better job of stepping through a problem and resolving it well enough, while also improving results when I interject.

conception · 2025-08-09T04:50:05 1754715005

CC is great but I prefer roo as I find it much easier to keep an eye on Claude’s work and guide (or cancel) it as it goes. You also have greater control over modes and which models you use but miss out on hooks and the secret sauce Anthropic has in it. Roo also more bugs.

CuriouslyC · 2025-08-09T13:13:33 1754745213

Claude the model is good but not amazing, O3/GPT5/Gemini 2.5 are better in most ways IMO. The Claude model does seem to have been trained on tool use and agentic behavior more than other models though, so even though the raw benchmarks are worse, it's more performant when used for agentic tasks, at least in terms of not getting confused and making a mess.

The big thing with Claude Code seems to be agentic process they've baked into it.

dnh44 · 2025-08-09T13:18:12 1754745492

Codex CLI has got better at this although I don't think it's better than Claude Code yet

derencius · 2025-08-09T16:15:42 1754756142

gemini cli is good. ampcode is very good and precise with changes.

but codex cli is very annoying to use. hopefully it will get usable.

cesarvarela · 2025-08-09T04:26:38 1754713598

Among other things the amount of usage you get for the price.