Well, LLMs are also remarkably good at generalizing. Look at the datasets, they don't literally train on every conceivable type of question the user might ask, the LLM can adapt just as you can.
The actual challenge towards general intelligence is that LLMs struggle with certain types of questions even if you *do* train it on millions of examples of that type of question. Mostly questions that require complex logical reasoning, although consistent progress is being done in this direction.
> Well, LLMs are also remarkably good at generalizing. Look at the datasets, they don't literally train on every conceivable type of question the user might ask, the LLM can adapt just as you can.
Proof needed.
I'm serious. We don't have the datasets. But we do know the size of the datasets. And the sizes suggest incredible amounts of information.
Take an estimate of 100 tokens ~= 75 words[0]. What is a trillion tokens? Well, that's 750bn words. There are approximately 450 words on a page[1]. So that's 1.66... bn pages! If we put that in 500 page books, that would come out to 3.33... million books!
Llama 3 has a pretraining size of 15T tokens[2] (this does not include training, so more info added later). So that comes to ~50m books. Then, keep in mind that this data is filtered and deduplicated. Even considering a high failure rate in deduplication, this an unimaginable amount of information.
That’s a very good point. I just speak from my experience of fine-tuning pre-trained models. At least at that stage they can memorize new knowledge, that couldn’t have been in the training data, just by seeing it once during fine-tuning (one epoch), which seems magical. Most instruction-tuning datasets are also remarkably small (very roughly <100K samples). This is only possible if the model has internalized the knowledge quite deeply and generally, such that new knowledge is a tiny gradient update on top of existing expectations.
But yes I see what you mean, they are dumping practically the whole internet at it, it’s not unreasonable to think that it has memorized a massive proportion of common question types the user might come up with, such that minimal generalization is needed.
I'm curious, how do you know this? I'm not doubting, but is it falsifiable?
I also am not going to claim that LLMs only perform recall. They fit functions in a continuous manner. Even if the data is discrete. So they can do more. The question is more about how much more.
Another important point is that out of distribution doesn't mean "not in training". This is sometimes conflated, but if it were true then that's a test set lol. OOD means not belonging to the same distribution. Though that's a bit complicated, especially when dealing with high dimensional data
I agree. It is surprising the degree to which they seem to be able to generalise, though I'd say in my experience the generalisation is very much at the syntax level and doesn't really reflect an underlying 'understanding' of what's being represented by the text — just a very, very good model of what text that represents reality tends to look like.
The commenter below is right that the amount of data involved is ridiculously massive, so I don't think human intuition is well equipped to have a sense of how much these models have seen before.
The actual challenge towards general intelligence is that LLMs struggle with certain types of questions even if you *do* train it on millions of examples of that type of question. Mostly questions that require complex logical reasoning, although consistent progress is being done in this direction.