I do think Claude Code as a tool gave Anthropic some advantages over others. They have plan mode, todolist, askUserQuestion tools, hooks, etc., which greatly extend Opus's capabilities. Agree that others (Codex, Cursor) also quickly copy these features, but this is the nature of the race, and Anthropic has to keep innovating to maintain its edge over others
The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.
(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)
(I work at Cursor) We have all these! Plan mode with a GUI + ability to edit plans inline. Todos. A tool for asking the user questions, which will be automatically called or you can manually ask for it. Hooks. And you can use Opus or any other models with these.
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
For one, these models should be able to understand the physical world via images, audio, and video. I do agree that current models are quite good at coding, but that's mainly because coding is entirely text-based and easily verifiable. It's not obvious that this capability will transfer to other domains that aren't text-based and aren't as easily verifiable.
I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?
P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.
I don't know where your impression about benchmaxxing comes from. Why would you assume closed models are not benchmaxxing? Being closed and commercial, they have more incentive to fake it than the open models.
You are not familiar, yet you claim a bias. Bias based on what? I use pretty much just open-source models for the last 2 years. I occasionally give OpenAI and Anthropic a try to see how good they are. But I stopped supporting them when they started calling for regulation of open models. I haven't seen folks get ahead of me with closed models. I'm keeping up just fine with these free open models.
Yeah, I get there's nuance between all of them. I ranked Minimax higher for its agentic capabilities. In my own usage, Minimax's tool calling is stronger than Deepseek's and GLM.
My observation is that vibe-coded applications are significantly lower quality than traditional software. Anthropic software (which they claim to be 90% vibe coded) is extremely buggy, especially the UI.
That's a misunderstanding based on loose definition of "vibe coding". When companies threw around the "90% of code is written by AI" claims, they were referring to counting characers of autocomplete basing on users actually typing code (most of which was eequivalent to "AI generated" code by Eclipse tab-completion decade ago), and sometimes writing hyperlocal prompts for a single method.
We can identify 3 levels of "vibe coding":
1. GenAI Autocomplete
2. Hyperlocal prompting about a specific function. (Copilot's orginal pitch)
3. Developing the app without looking at code.
Level 3 is hardly considered "vibe" coding, and Level 2 is iffy.
"90% of code written by AI" in some non-trivial contexts only very recently reached level 3.
I don't think it ever reached Level 2, because that's just a painfully tedious way of writing code.
They have not said that. They've only said that most of their code is written by Claude. That is different than "vibe coding". If competent engineers review the code then it is little different than any coding.
IIRC, the Claude Code creator mentioned that all the PRs are reviewed by humans, just like normal human PRs. So yes, humans still look at the code at the review stage. Though I still consider this to be level 3, but anyway, this is just a matter of definition.
I mostly work at level 2, and I call it "power coding", like power armor, or power tools. Your will and your hand still guides the process continuously. But now your force is greatly multiplied.
Over the weekend, I wrote this small Python library to teach myself the core idea behind modern agentic systems. This kind of software sits at the core of Claude Code, Codex, etc. I wanted to see if I could build it from scratch, so this is mostly educational for me.
The result is a surprisingly simple piece of software. At its core are immutable DAGs, which keep the design simple and easy to reason about.
I also added a set of built-in tools that are inspired by Claude Code's built-in tools.
A bonus point: it can also capture Claude Code auth tokens, so you can use it with your Claude Code subscription. However, there is a chance that Anthropic will ban you if they detect this, so use it at your own risk.
P.S.: One additional point I also want to mention is that Claude Code (SDK) is closed-source, so I cannot modify it for my use case or fix its buggy UI on my own. This is one of the factors for why I'm creating this library.
More than 30% of the times you use Claude Code it "flickers"? That can't be right? I use neovim and codex side by side with tmux, both flicker about 0%, what is Claude Code doing that makes it flicker so much? Seems strange
(It's worth reading the gh comment I linked if you're interested in terminals!)
tl;dr other programs like Neovim and Codex use the "alternate screen buffer" which means they don't use scrollback and reimplement their own scrolling. CC uses scrollback (because that's what most users expect) which it has to clear entirely and redraw everything when it changes (causing tearing/flickering). There's no way to incrementally update scrollback in a terminal.
(I also want to add some more flavor to the 1/3 metric because I don't want it to be mis-interpreted. "30% of the time you use CC it flickers" isn't quite accurate - it's dependent on screen height and what you do. Most people will not see _any_ flickers at all. Some people with short screens (typically VSCode users because by default the terminal opens fairly short) will see flickers. Previously, if something rendered offscreen users would see a flicker for _every subsequent frame_ regardless of wether anything was actually changing. Now they will only see a flicker occasionally when it's _absolutely_ needed. Once or twice vs thousands.
Additionally, the metric really tracks when CC emits a "clear scrollback" operation. If the user is in a terminal that supports DEC 2026 they won't see a flicker even if we emit that clear scrollback command.)
There is absolutely a way to incrementally update scrollback in a terminal, 100% flicker-free. Whether it works in every terminal is a different question. But if you can accept that your code will work in pretty much every modern terminal -- this is absolutely doable. I double people are still using xterm and other older terminals for this. And in that case, you can fall back to this more compatible way.
I have a hypothesis: they haven't fixed this because they're using Claude Code to develop Claude Code. I'm a fan of Claude Code, but it isn't good enough to fix tricky issues like this. And because no one looks at the codebase themselves, they haven't been able to fix it after many months. Sometimes, all we need is an engineer to sit down for the weekend and fix the damn bug, not spin up 9 different Claude Agents prompted to fix itself.
Perhaps the engineer could sit down for 8 hours a day during the work week. The Silicon Valley obsession with having no life and working weekends is so endemic.
Interesting, i can see this being very similar to Nvidia's CUTE DSL. This hints that we are converging to a (local) optimal design for Python-based DSL kernel programming.
> Once the model is fully released, scientists will be able to adapt and fine-tune it on their own datasets to better tackle their unique research questions.
This is in the press release, so they are going to release the weights.
reply