Back in the day, Google eng had pretty unguarded access to people's gmails, calendars, etc. Then there was a news story involving a Google SRE grooming children and stalking them through their google accounts...
It was the classic "oh no we did caching wrong" bug that many startups bump into. It didn't expose actual conversations though, only their titles: https://openai.com/index/march-20-chatgpt-outage/
The extra system prompt can definitely cause some performance issues, and it can over use them. The deleting every line behavior is gone though. It's definitely not something you should turn on for every conversation, but it's quite compelling for creating little capsule web apps.
Sort of a hardware advancement. I'd say it's more of a sidegrade between different types of well-established processor. Take out a couple cores, put in some extra wide matrix units with accumulators, watch the neural nets fly.
But I want to point out that going from CPU to TPU is basically the opposite of a Moore's law improvement.
> And you should be able to get two and load half your model into each. It should be about the same speed as if a single card had 32GB.
This seems super duper expensive and not really supported by the more reasonably priced Nvidia cards, though. SLI is deprecated, NVLink isn't available everywhere, etc.
Every layer of an LLM runs separately and sequentially, and there isn't much data transfer between layers. If you wanted to, you could put each layer on a separate GPU with no real penalty. A single request will only run on one GPU at a time, so it won't go faster than a single GPU with a big RAM upgrade, but it won't go slower either.
reply