This post was a wake up call. Last year I spent a week learning Swift the "old fashioned way" following Apple's tutorials, with not much to show for it when I was done. Even though I use AI for coding daily, since XCode doesn't natively support it, I just decided to not use it.
Seeing how far you went in just a couple of days, I realize now how much I missed out.
Just like it's mentioned later in the article: it doesn't really matter if you get an addition mostly right. You either get it right or you don't. I still appreciate their effort though, because even after altering the grading system, there were still some emergent abilities.
Assume we have a child, and we test him regularly:
- Test 1: First he can just draw squiggles on the math test
- Test 2: Then he can do arithmetic correctly
- Test 3: He fails on the last details on the algebraic calculation.
Now, event though he fails on all tests, any reasonable parent would see that he improving nicely, and would be able to work in his chosen field in a year or so.
Or alternatively, if we talk about AI, we can set the Test as a threshold, and we see the results are continuously trending upwards, and we can expect the curve to breach the threshold in the future.
That is; measuring improvement, instead of pass/fail, allows one to predict when we might be able to use the AI for something.
With AI you can do millions of tests. Some tests are easy by chance (eg. "Please multiply this list of numbers by zero"). Some tests are correct by chance alone, easy or hard.
When you actually do these millions of tests, I don't think it really matters what the exact success metric is - an AI which is 'closer to correct, but still wrong' on one test will still get more tests correct overall on the dataset of millions of tests.
Human beings do arithmetic problems wrong all the time so I'm not sure "doing addition 100% right" is a merit of intelligence.
I'm not saying LLM will achieve AGI (I don't know if it will, or when it does we'll even know). But somehow people seem to be judging AI's intelligence with this simple procedural:
1. Find a task that AI can't do perfectly.
2. Gotcha! AI isn't intelligent.
It just makes me question humans' intelligence if anything.
Arithmetics is extremely easy for a neural network to perform and learn perfectly, that LLMs fails to learn it even though it is so easy is strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.
> Human beings do arithmetic problems wrong all the time
Humans built cars and planes and massive ships before we had calculators, that requires a massive amount of calculations that are all perfect to be possible. Humans aren't bad at getting calculations right, they are just a bit slow. Today humans are bad since we don't practice it, not because we can't. LLMs can't do that today, can learn and can't is a massive difference.
My intuition is that a significant challenge for LLMs' ability to do arithmetics has to do with tokenization. For instance, `1654+73225` as per the OpenAI tokenizer tool breaks down into `165•4•+•732•25`, meaning the LLM is incapable of considering digits individually; that is, "165" is a single "word" and its relationship to "4" and in fact each other token representing a numerical value has to be learned. It can't do simple carry operations (or other arithmetic abstractions humans have access to) in the vast majority of cases because its internal representation of text is not designed for this. Arithmetic is easy to do in base 10 or 2 or 16, but it's a whole lot harder in base ~100k where 99% of the "digits" are words like "cat" or "///////////".
Compare that to understanding arbitrary base64-encoded strings; that's much harder for humans to do without tools. Tokenization still isn't _the_ greatest fit for it, but it's a lot more tractable, and LLMs can do it no problem. Even understanding ASCII art is impressive, given they have no innate idea of what any letter looks like, and they "see" fragments of each letter on each line.
So I'm not sure if I agree or disagree with you here. I'd say LLMs in fact have very impressive capabilities to learn logical structures. Whether grammar is the problem isn't clear to me, but their internal representation format obviously and enormously influences how much harder seemingly trivial tasks become. Perhaps some efforts in hand-tuning vocabularies could improve performance in some tasks, perhaps something different altogether is necessary, but I don't think it's an impossible hurdle to overcome.
I don't think that's really how it works - sure this is true at the first level in a neural network, but in deep neural networks after the first few layers the LLM shouldn't be 'thinking' in tokens anymore.
The tokens are just the input - the internal representation can be totally different (and that format isn't tokens).
Please don't act like you "know how it works" when you obviously don't.
The issue is not the fact that the model "thinks or doesn't think in tokens". The model is forced at the final sampling/decoding step to convert it's latent back into tokens, one token at a time.
The models are fully capable of understanding the premise that they should "output a 5-7-5 syllable Haiku", but from the perspective of a model trying to count its own syllables, this is not possible, as its own vocabulary is tokenized in such a way that not only does the model not have direct phonetic information within the dataset, but it literally has no analogue for how humans count syllables (measuring mouth drops). Models can't reason about the number of characters or even tokens used in a reply too for the same exact reason too.
The person you're replying to broadly is right, and you are broadly wrong. The internal format does not matter when the final decoding step forces a return of tokenization. Please actually use these systems rather than pontificating about them online.
That requires converting from a weird unhelpful form into a more helpful form first, so yes but the tokenisation makes things harder as it adds an extra step - they need to learn how these things relate while having significant amounts of the structure hidden from them.
This conversion is inherent in the problem of language and maths though - Two, too (misspelt), 2, duo, dos, $0.02, and one apple next to another apple, 0b10 and 二 can all represent the (fairly abstract) concept of two.
The conversion to a helpful form is required anyway (also lets remember that computers don't work in base 10, and there isn't really a reason to believe that base 10 is inherently great for LLM's either)
* replace {}{}{} with addition, {}{} is subtraction unless followed by three spaces in which case it's also addition
* translate and correct any misspellings
* [512354] look up in your tables
* _ is 15
* dotted lines indicate repeated numbers
Technically they're doing the same thing. One we would assume is harder to learn the fundamental concepts from.
Right, which is why testing arithmetics is a good way to test how well LLMs generalize their capabilities to non text tasks. LLMs can in theory be excellent at it, but they aren't due to how they are trained.
The tokens are the structure over which the attention mechanism is permutation equivariant. This structure permeates the forward pass, its important at every layer and will be until we find something better than attention.
> Arithmetics is extremely easy for a neural network to perform and learn perfectly
That'd depend on the design of the neural net and training objective.
It's certainly not something that comes naturally to an LLM which neither has numbers as inputs or outputs, nor is trained with an arithmetic objective.
Consider inputting "12345 * 10" into GPT-4. First thing it is going to do is tokenize the input, then embed these tokens, and these embedding vectors are then the starting point of what the transformer has to work with...
You can use OpenAI's tokenizer tool (above) to see how it represents the "12345 * 10" character sequence as tokens, and the answer is that it breaks it down into the token ID sequence [4513, 1774, 353, 220, 605]. The [4513, 1774] represents the character sequence "12345", and "605" represents the character sequence "10".
These token ID's will then be "embedded", which means mapping them to points in a very high dimensional space (e.g. 4096-D for LLaMA 7B), so each of those token ID's becomes a vector of 4096 1's and 0's, and these vectors are what the model itself actually sees as input.
So, for "12345 * 10", what the model sees during training is that whenever it sees V1 V2 V3 V4 it should predict V5, where V1-5 are those 4096-D input token embeddings. The model has no idea what any of these mean - they might represent "the cat sat on the mat" for all it knows. They are just a bunch of token representations, and the LLM is just trying to find patterns in the examples it is given to figure out what the preferred "next token" output is.
So, could you build (and train) a neural net to multiply, or add, two numbers together? Yes you could, if that is all you want to do. Is that what an LLM is? No, an LLM is a sequence predictor, not an NN designed and trained to do arithmetic, and all that is inside an LLM is a transformer (sequence-to-sequence predictor).
I know why it is hard for LLMs to learn this, that was the whole point. The way we make LLMs today means they can't identify such structures, and that is strong evidence they wont become smart just by scaling since all the things you brought up will still be true as we scale up.
To solve this you would need some sub networks that are pretrained to handle numbers and math and other domains, and then you start training the giant LLM it can find and connect those things. But we don't know how to do that well yet afaik, and I bet all the big players has already tested things like that. As you say adding capabilities to the same model is hard.
An LLM can learn to identify math easily enough, it's just that performing calculations just using language isn't very efficient, even if it's basically what we do ourselves. If you want an LLM to do it like us, then give it a pencil and paper ("think step by step").
If you want the LLM to be better than a human at math, then give it a calculator, or access to something like Wolfram Alpha for harder problems. Your proposed solution of "give it a specialized NN for math" is basically the same, but if you are going to give it a tool, they why not give it a more powerful one like a calculator ?!
Humans were terrible at getting calculations right - that's why we invented abacuses, slide-rules, books of mathematical tables and tabulation machines.
Humans invented those since we are slow and have limited working memory. But we managed to invent those since we understand how to perform reliable calculations.
Yes, but that acknowledges that there is a difference between understanding how to perform reliable calculations, and actually being able to perform reliable calculations.
Humans are good at the former, but not the latter.
Humans are good at performing reliable calculations with pen and paper. That is the same kind of tools that LLMs works with. I'm not sure why humans can do that but not LLMs, the task should be way easier for an LLM.
> Humans are good at performing reliable calculations with pen and paper.
Speak for yourself. Even though I've always been strong at my conceptual understanding and problem solving in math, I always found it difficult to avoid arithmetic mistakes on pen and paper and could never understand why I was assessed on that. I could have done so much better in high-school math if I was allowed to use a programmable computer for the calculations.
And I think it's the same for LLMs, we should assess them on doing the arithmetic in a single pass, but rather on writing the code to perform the calculation, and responding based on that.
Maybe a lot of people suffer from a degree of dyscalculia, but in my experience if you do it a lot you just stop making mistakes. Not just me, many others I've seen reliably do calculations pretty quick without making errors, you just do everything twice as you go and then arithmetic errors go to basically 0.
But I do acknowledge that there are probably some or many humans that maybe can't reach that level of reliability with arithmetics.
LLMs (internally) don't have a pen and paper equivalent. Their output is the output of their neurons. Like if I was a head on a table with a screen on my forehead that printed out my thoughts as they appeared in my head. Ask (promt) me my favorite color and "green" would show up on the screen.
This is why prompting LLM's to show their steps works so well, it makes them work through the problem "in their head" more efficiently, rather than just spit out an answer.
However, you can give LLM's external access to tools. Ask GPT4 a particularly challenging math problem, and it will write a python script and run it to get a solution. That is an LLM's "pen and paper".
No, that is an LLM's calculator or programming, it doesn't actually do the steps when it does that. When I use pen and paper to solve a problem I do all steps on my own, when I use a calculator or a programming language the tool does a lot of the work.
That difference is massive, since when I use a calculator that doesn't help me learn numbers and how they interact and how algorithms works, while if I do the steps myself I do. So getting an LLM that can reliably execute algorithms like us humans can is probably a critical step towards making them as reliable and smart as humans.
I do agree though that if LLMs could keep a hidden voice they used to reason before writing they could do better, but that voice being shown to the end user shouldn't make the model dumber, you would just see more spam.
You are spitting hairs on technicalities here. You need to do a lot of "steps" to write a program that solves your question. Debatably even more steps and more complexity than using pen and paper.
Maybe we should be giving the LLM's MS paint instead of python to work out problems? There is nothing unique or "human" about running through a long division problem, it is ultimately just an algorithm that is followed to arrive at a solution.
> There is nothing unique or "human" about running through a long division problem, it is ultimately just an algorithm that is followed to arrive at a solution.
Yes, which is why we should try to make LLMs do them and that way open them up to learn much more complex understanding of algorithms and instructions that humans has yet to build a tool for.
> You need to do a lot of "steps" to write a program that solves your question. Debatably even more steps and more complexity than using pen and paper.
What does this have to do with anything? I am highlighting a core deficiency in how LLMs are able to reason, you saying that what they currently do is harder doesn't change the fact that they are bad at this sort of reasoning.
And no, making such a program doesn't require more steps or understanding. You Google for a solution and then paste in your values, that is much easier to teach a kid than to teach them math. I am sure I can teach almost any 7 year old kid to add two numbers by changing values in a python program in about an hour, much faster than they could learn math the normal way. Working with such templates is the easiest task for an LLM, what we want is to try to get the LLM to do things that is harder for it.
"I have a problem for you to solve. Muffins sell for $3/each. rick bakes 30 muffins a day. Tom bakes 2 muffins monday, 4 tuesday, 6 wednsdays, up to 14 on sunday. On days which tom and jerry combined bake more than 41 muffins, the price of the muffins drops to $2.50. How much total revenue do rick and tom take in during a full week, combined."
Please tell me how ChaptGPT4 writing a script to solve that is not logical reasoning, while a human pulling out pen and paper to do it is...
> Please tell me how ChaptGPT4 writing a script to solve that is not logical reasoning, while a human pulling out pen and paper to do it is...
I changed the prompt a bit (made all the numbers 3-4 digits) and gpt-4 answered with this, it just made up numbers for the days that you didn't add numbers for so it failed before it even came to arithmetics. Here is what it said, after I said this about tom "Tom bakes 2911 muffins monday, 491 tuesday, 699 wednsdays, up to 149 on sunday.", it just assumed sundays number was for all other weekdays not given a human wouldn't do that, and it missed the "up to" statement. Maye the large numbers I gave threw it off, but if that is enough to throw it of just shows that it can't really reason.
So thanks for that, more evidence these models are bad at reasoning.
Here is the first part of what it responded with, it is wrong already here:
First, let's calculate the number of muffins baked by Tom during the week:
Monday: 2911
Tuesday: 491
Wednesday: 699
Thursday: 149
Friday: 149
Saturday: 149
Sunday: 149
Edit: Here it made an arithmetics error just below, the error is that 4062 is not greater than 4199, so two critical errors, I taught math at college for years and you wouldn't find many students making mistakes like this:
Let's determine the days when Tom and Rick combined bake more than 4199 muffins:
Monday: 2911 (Tom) + 3571 (Rick) = 6482
Tuesday: 491 (Tom) + 3571 (Rick) = 4062
Wednesday: 699 (Tom) + 3571 (Rick) = 4270
On Monday, Tuesday, and Wednesday, they bake more than 4199 muffins combined, so the price of the muffins drops to $2851.50 on those days.
Just so we have this straight, you completely changed the nature of the problem (by turning a perfect information problem into an imperfect information problem) and then are looking at me with a straight face to make your point? Please...
Unless of course you didn't realize that tom has a pattern to his baking, at which point to irony becomes palpable.
And on top of that, I am willing to bet if you give me your prompt, I would be able to restructure it in such a way that GPT4 would be able to answer it correctly. More often than not, people are just really bad at properly asking it questions.
> Just so we have this straight, you completely changed the nature of the problem (by turning a perfect information problem into an imperfect information problem) and then are looking at me with a straight face to make your point? Please...
I used your exact quote and just changed the numbers, it is still a perfect information problem.
Or, ah right you mean you gave me an imperfect information problem since you assumed the reader would guess those values. Yeah, I read it as a perfect information problem where all values were given, and then you would give the income as a range of possible income values based on how many muffins were baked on Sunday. None of the LLMs I sent it to managed to solve it entirely, it is a pretty easy problem.
Reasonable way to parse your sentence is:
Monday: 2, Tuesday: 4, Wednesday: 6, Sunday: 0-14, rest: doesn't work so 0
> Unless of course you didn't realize that tom has a pattern to his baking, at which point to irony becomes palpable.
If you didn't say he baked on those days then he didn't bake on those days. The specification is clear. If I say "I will bake 2 muffins on Tuesday and 6 muffins on Sunday" the reasonable interpretation is that I wont bake anything the rest of the days. Why would you assume he baked anything at all those days?
Or if I say "Emily will work Mondays and Thursdays", do you just guess the rest of the days she will work? No, you assume she just works those days.
Is that a standard problem you wrote from memory? Not sure why you would assume there were muffins baked in the days you didn't list.
For example, if I say Tom bakes up to 14 muffins on Sunday, then the reasonable interpretation is that Tom will bake 0-14 muffins on Sunday. Maybe you should write the prompt clearer if you mean something else? Because as written anyone would assume that he didn't bake the other day, and on sundays he baked up to 14 muffins.
Anyway, it failed even with your "up to" interpretation meaning the reader should fill in the values, it still made that math error. But it using your "up to" interpretation there is a huge red flag, since in a real environment nobody would give that kind of information as a riddle with hidden values, you would specify all the values for each day each person worked and the rest you assume the person just isn't working and baked 0 muffins. If the LLM starts to guess values for some patterns and words where it doesn't make sense then it is really unreliable.
I know they can do that, but not as reliably as I can for example or typical engineers from 80 years ago. I did engineer exams without a calculator just did all the calculations with pen and paper, didn't make mistakes, just takes a bit longer since calculating trigonometric functions takes a bit but still not a lot of time compared to how much you have.
That was how everyone did it back then, it really isn't that hard to do. Most people today never tried to do it so they think it is much harder than it actually is.
> strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.
To add multi-digit numbers requires short term memory (are we on the units, or tens? was there a carry?), which LLMs don't have, so that's really the issue.
The normal workaround for lack of memory in LLMs is "think step-by-step" to use it's own output (which gets fed back in as an input) as memory, and I'd assume that with appropriate training data and prompting an LLM could learn to do it in this fashion - not just giving the answer, but by giving all the steps.
I suppose in theory LLMs could do limited precision math even without memory, if they did it in a single pass through their stack of transformer layers (96 for GPT-3) - use first few layers to add units and generate carry, next few layers to add tens, etc. I'm not sure how, or even if, one could train them to do it this way though - perhaps via a curriculum training agenda of first adding single digit numbers, then two-digit ones, etc ?
> Arithmetics is extremely easy for a neural network to perform and learn perfectly, that LLMs fails to learn it even though it is so easy is strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.
IDK, there was an article posted here about yet another LLM that performed very badly on the math tests because they mistakenly left out all the math training data.
What impressed me was that it could learn any math at all from just 'reading' books or whatever. Though, perhaps, any correct answer could be attributed to pure luck, dunno.
Yes, loss minimization quickly gets to the correct implementation of arithmetics since the primitives of neural networks are just math operations, so training it to add or multiply two inputs into an output is very easy. This is so easy and obvious that you run it to test that your neural network implementation works, if it can't figure out arithmetics then you have done something wrong.
LLMs fails to figure out that this is what it has to do, instead it looks like it has a ton of specialized rules to handle arithmetics that results in a lot of errors in the output and are extremely expensive to run.
So the networks you mentioned aren’t LLMs? Why is that a correct comparison then. Like blaming a human that they can’t jump like a cat or multiply like an arbitrary-precision library.
> So the networks you mentioned aren’t LLMs? Why is that a correct comparison then
Because an LLM is a neural network and neural networks contains neural networks. There is nothing stopping it from having an embedded neural network that learned how to do computations well, except an inability to identify such structures and patterns well enough to train for it.
> It just makes me question humans' intelligence if anything.
A more serious byproduct of the tendency to talk about machines in anthropomorphic terms is the companion phenomenon of talking about people in mechanistic terminology. The critical reading of articles about computer-assisted learning —excuse me: CAL for the intimi— leaves you no option: in the eyes of their authors, the educational process is simply reduced to a caricature, something like the building up of conditional reflexes. For those educationists, Pavlov’s dog adequately captures the essence of Mankind —while I can assure you, from intimate observations, that it only captures a minute fraction of what is involved in being a dog.
I mean I could equally say that the opposing bias is
1. Choose a few good blunt instruments we use to gatekeep students on the premise that it tests their "intelligence" (or wait, do we mean subject matter comprehension with this one?)
2. Apply a big ol' machine learning model to those tests
3. Woa it's smarter than a third grader! OMG it's smarter than a lawyer! You guys this must be ASI already!
Rhetoric and selective rigor can justify any perspective. Smart and stupid arguments can be made for any position. Water is wet
I also can't claim to know with certainty whether transformers are going to end up being AGI in some meaningful sense, but I will definitely say that we've created a lot of rubrics for assessing human intelligence that mostly exist for expediency, and a cursory glance at education should tell you there's a lot of Goodhart's Law going on with all of 'em. I know for a fact I can do a damn good job on your average multiple choice test on knowing some etymology and being good at logical elimination, and I can bullshit my way through an essay, both without taking the class, and I view this more as a flaw in the instrument than evidence that I'm a godlike superintelligence that can just know anything without studying it. Humans make a lot of tests that are soft to bullshitting with a little pattern-recognition thrown in
LLMs aren't wrong by a small percentage, they are wrong by a small number of tokens. They can miss a zero or be off by 100% and its just a token difference, to the LLM that is a minor mistake since everything else was right but it is a massive mistake in practice.
I watch math classes on youtube and some lecturers make symbolic mistakes all the time. Minus instead of a plus, missing exponents, saying x but writing y, etc. They only notice it when something unexpected contradicts down the line.
They got it right as you said, it just took a bit longer. That doesn't contradict what I said, humans can get things right very reliably by looking over the answers especially if you have another human to help look at the answers. An AI isn't comparable to a human, it is comparable to a team of humans, two ChatGPTs can't get more accurate by correcting each others answers but two humans can.
A professor might be able to iterate to a correct answer but a student might not.
And ChatGPT is definitely able to get improve its answer by iterating, it just depends on the toughness of the problem. If it's too difficult, no amount of iteration will get it much closer to the correct answer. If it's closer to its reasoning limits, then iterating will help.
But if you stop them just there, an error persists. A professor is “multi-modal” and in a constant stream of evebts, including their lecture plan and premeditated key results. Are you sure that at some level of LLM “intelligence”, putting it into the same shoes wouldn’t improve the whole setting enough? I mean sure, they make mistakes. But if you stop-frame a professor, they make mistakes too. They don’t correct immediately, only after a contradiction gets presented. Reminds me how LLMs behave. Am I wrong here?
Edit: was answering to gp, no idea how my post got here
Asking the LLM to correct itself doesn't improve answers since they will happily add errors to correct answers when asked to correct it. That makes it different from humans, humans can iterate and get better, our current LLMs can't.
> But if you stop them just there, an error persists
But humans doesn't stop there when they are making things that needs to be reliably correct. When errors aren't a big deal humans make a lot of errors, but when errors costs life humans become very reliable by taking more time and looking things over. They still sometimes makes mistakes that kills people, but very rarely.
So many things contribute to human error it is probably impossible to make a 1 to 1 parallel with LLM's. For instance, the fact that you are being recorded is in many cases a significant performance drop.
What uncertainty and threshold is there in the addition of integers, for example (within mathematics and the usual definitions)? Or in Boolean logic with the "and" operation?
I don't think everything has uncertainty and thresholds to it, especially, when it actually resides outside of a technical implementation.
To verify the answer you'll always need to trust the technical implementation that's doing the computation. Doesn't matter if it's our brains or a calculator.
Somewhere between "it's always wrong" and "it's always right unless the bits got flipped by cosmic rays" we deem the accuracy to be good enough.
Disagree, the theory exists outside of any specific technical implementation (every single one of those could be wrong, for example). You might not be able to verify something without being subject to random errors, but that doesn't mean the theory itself is subject to random errors.
Any implementation (or write down etc.) of something can have errors, but the errors are in the implementation and do not give rise to uncertainty outside of the implementation. There is no uncertainty as to what the sum of two integers should be (within the usual mathematics).
I totally see where you’re coming from, and yes, the first few days on it were nightmarish, but I eventually got good on it. I code as fast as I do on my 65%, and not having to ever move your hands away from the main position is actually super comfortable! Chords and tap dancing definitely require a lot of muscle memory, but they are game changers.
They are great as they have cuda cores etc., the only downside is you have to use their ubuntu image to utilize all the features as the drivers are proprietary, you can install other distributions but you will lose that functionality.
If you are looking for an -almost- as powerful SBC (with m.2 support as I see you have it in the original design), give a look at Rock5-B (1), I didn’t test it personally unlike nVidia ones, but it seems it has a great specs, as you might also struggle to secure jetson ones sometimes due to stocks being sold out.
I remember I tried Yocto, indeed it’s customizable, but when I tested it, its performance wasn’t on par with the stock one, specifically with gstreamer and vp9 encoding/decoding. Maybe it’s changed now though, worth the try OP.
Heh - I can't resist throwing a mention in for my favorite here too (1). Available, cheap, x86 based... I love them :-) I have 3-4 of the H2 models, and they have only gotten better !
I built a cyberdeck bigger than this and the case came out to be around 30 euros.