I'm not a researcher. I acknowledge Mamba, RWKV, Hyena and the rest but like I s...

godelski · 2025-08-08T22:54:24 1754693664

So fair, they fall under the LLM bucket but I think most things can. Still, my point is about that there's a very narrow exploration of techniques. Call it what you want, that's the problem.

And I'm not arguing there's zero investment, but it is incredibly disproportionate and there's a big push for it to be more disproportionate. It's not about all or none, it is about the distribution of those "investments" (including government grants and academic funding).

With the other architectures I think you're being too harsh. Don't let perfection get in the way of good enough. We're talking about research. More specifically, about what warrants more research. Where would transformers be today if we made similar critiques? Hell, we have a real life example with diffusion models. Sohl-Dickstein's paper came out at a year after Goodfellow's GAN paper and yet it took 5 years for DDPM to come out. The reason this happened is because at the time GANs were better performing and so the vast majority of effort was over there. At least 100x more effort if not 1000x. So the gap just widened. The difference in the two models really came down to scale and the parameterization of the diffusion process, which is something mentioned in the Sohl-Dickstein paper (specifically as something that should be further studied). 5 years really because very few people were looking. Even at that time it was known that the potential of diffusion models was greater than GANs but the concentration went to what worked better at that moment[0]. You can even see a similar thing with ViTs if you want to go look up Cordonnier's paper. The time gap is smaller but so is the innovation. ViT barely changes in architecture.

There's lots of problems with SSM and other architectures. I'm not going to deny that (I already stated as much above). The ask is to be given a chance to resolve those problems. An important part of that decision is understanding the theoretical limits of these different technologies. The question is "can these problems be overcome?" It's hard to answer, but so far the answer isn't "no". That's why I'm talking about diffusion and ViTs above. I could even bring in Normalizing Flows and Flow Matching which are currently undergoing this change.

  > It's not enough to simply match transformers.

I think you're both right and wrong. And I think you agree unless you are changing your previous argument.

Where I think you're right is that the new thing needs to show capabilities that the current thing can't. Then you have to provide evidence that its own limitations can be overcome in such a way that overall it is better. I don't say strictly because there is no global optima. I want to make this clear because there will always be limitations or flaws. Perfection doesn't exist.

Where I think you're wrong is a matter of context. If you want the new thing to match or be better than SOTA transformer LLMs then I'll refer you back to the self-fulfilling prophecy problem from my earlier comment. You never give anything a chance to become better because it isn't better from the get go.

I know I've made that argument before, but let me put it a different way. Suppose you want to learn the guitar. Do you give up after you first pick it up and find out that you're terrible at it? No, that would be ridiculous! You keep at it because you know you have the capacity to do more. You continue doing it because you see progress. The logic is the exact same here. It would be idiotic of me to claim that because you can only play Mary Had A Little Lamb that you'll never be able to play a song that people actually want to listen to. That you'll never amount to anything and should just give up playing now.

My argument here is don't give up. To look how far you've come. Sure, you can only play Mary Had A Little Lamb, but not long ago you couldn't play a single cord. You couldn't even hold the guitar the right way up! Being bad at things is not a reason to give up on them. Being bad at things is the first step to being good at them. The reason to give up on things is because they have no potential. Don't confuse lack of success with lack of potential.

  > I'm not arguing anything. You asked why the disproportionate funding.

I guess you don't realize it, but you are making an argument. You were trying to answer my question, right? That is an argument. I don't think we're "arguing" in the bitter or upset way. I'm not upset with you and I hope you aren't upset with me. We're learning from each other, right? And there's not a clear answer to my original question either[1]. But I'm making my case for why we should have a bit more of what we currently use so that we get more in the future. It sounds scary, but we know that by sacrificing some of our food that we can use it to make even more food next year. I know it's in the future, but we can't completely sacrifice the future for the present. There needs to be balance. And research funding is just like crop planning. You have to plan with excess in mind. If you're lucky, you have a very good year. But if you're unlucky, at least everyone doesn't starve. Given that we're living in those fruitful lucky years, I think it is even more important to continue the trend. We have the opportunity to have so many more fruitful years ahead. This is how we avoid crashes and those cycles that tech so frequently goes through. It's all there written in history. All you have to do is ask what led to these fruitful times. You cannot ignore that a big part was that lower level research.

[0] Some of this also has to do with the publish or perish paradigm but this gets convoluted and itself is related to funding because we similarly provide more far more funding to what works now compared to what has higher potential. This is logical of course, but the complexity of the conversation is that it has to deal with the distribution.

[1] I should clarify, my original question was a bit rhetorical. You'll notice that after asking it I provided an argument that this was a poor strategy. That's framing of the problem. I mean I live in this world, I am used to people making the case from the other side.