I just used GPT-OSS-120B on a cross Atlantic flight on my MacBook Pro (M4, 128GB...

XCSme · 2025-08-07T10:13:48 1754561628

I know there was a downloadable version of Wikipedia (not that large). Maybe soon we'll have a lot of data stored locally and expose it via MCP, then the AIs can do "web search" locally.

I think 99% of web searches lead to the same 100-1k websites. I assume it's only a few GBs to have a copy of those locally, thus this raises copyright concerns.

Aurornis · 2025-08-07T12:55:00 1754571300

The mostly static knowledge content from sites like Wikipedia is already well represented in LLMs.

LLMs call out to external websites when something isn’t commonly represented in training data, like specific project documentation or news events.

XCSme · 2025-08-07T16:09:41 1754582981

That's true, but the data is only approximately represented in the weights.

Maybe it's better to have the AI only "reason", and somehow instantly access precise data.

stirfish · 2025-08-10T23:28:58 1754868538

Is this Retrieval Augmented Generation, or something different?

XCSme · 2025-08-11T06:25:57 1754893557

Yes, RAG, but have the model specifically optimzied for RAG.

adsharma · 2025-08-07T16:46:11 1754585171

What use cases will gain from this architecture?

XCSme · 2025-08-08T06:23:46 1754634226

Data processing, tool calling, agentic use. Those are also the main use-cases outside "chatting".

conradev · 2025-08-07T09:53:12 1754560392

Are you using Ollama or LMStudio/llama.cpp? https://x.com/ggerganov/status/1953088008816619637

diggan · 2025-08-07T11:24:49 1754565889

> LMStudio/llama.cpp

Even though LM Studio uses llama.cpp as a runtime, the performance differs between them. With LM Studio 0.3.22 Build 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s on a RTX Pro 6000, while with llama.cpp compiled from 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost 100 more per second, both running lmstudio-community/gpt-oss-120b-GGUF.

esafak · 2025-08-07T13:33:00 1754573580

Is it always like this or does it depend on the model?

diggan · 2025-08-07T13:38:49 1754573929

Depends on the model. Each runner needs to implement support when there are new architectures, and they all seemingly focuses on different things. As far as I've gathered so far, vLLM focuses on inference speed, SGLang on parallizing across multiple GPUs, Ollama on being as fast out the door with their implementation as possible, sometimes cutting corners, llama.cpp sits somewhere in-between Ollama and vLLM. Then LM Studio seems to lag slightly behind with their llama.cpp usage, so I'm guessing that's the difference between LM Studio and building llama.cpp from source today.

fouc · 2025-08-07T15:32:21 1754580741

What was your iogpu.wired_limit_mb set to? By default only ~70% or ~90GB of your RAM will be available to your GPU cores unless you change your wired limit setting.

MoonObserver · 2025-08-07T09:51:00 1754560260

M2 Max processor. I saw 60+ tok/s on short conversations, but it degraded to 30 tok/s as the conversation got longer. Do you know what actually accounts for this slowdown? I don’t believe it was thermal throttling.

summarity · 2025-08-07T09:54:25 1754560465

Physics: You always have the same memory bandwidth. The longer the context, the more bits will need to pass through the same pipe. Context is cumulative.

VierScar · 2025-08-07T10:02:30 1754560950

No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance

I seriously doubt it's the throughput of memory during inference that's the bottleneck here.

MereInterest · 2025-08-07T11:18:12 1754565492

Nitpick: O(n^2) is quadratic, not exponential. For it to “increase exponentially”, n would need to be in the exponent, such as O(2^n).

esafak · 2025-08-07T13:35:26 1754573726

To contrast with exponential, the term is power law.

zozbot234 · 2025-08-07T10:09:48 1754561388

Typically, the token generation phase is memory-bound for LLM inference in general, and this becomes especially clear as context length increases (since the model's parameters are a fixed quantity.) If it was pure compute bound there would be huge gains to be had by shifting some of the load to the NPU (ANE) but AIUI it's just not so.

summarity · 2025-08-07T11:09:05 1754564945

It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth.

zozbot234 · 2025-08-07T12:02:11 1754568131

Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase.

(A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)

torginus · 2025-08-07T13:09:38 1754572178

Inference takes quadratic amount of time wrt context size

gigatexal · 2025-08-07T10:25:01 1754562301

M3 Max 128GB here and it’s mad impressive.

Im spec’ing out a Mac Studio with 512GB ram because I can window shop and wish but I think the trend for local LLMs is getting really good.

Do we know WHY openAI even released them?

diggan · 2025-08-07T11:13:26 1754565206

> Do we know WHY openAI even released them?

Regulations and trying to earn good will of developers using local LLMs, something that was slowly eroding since it was a while ago (GPT2 - 2019) they released weights to the public.

Epa095 · 2025-08-07T14:08:14 1754575694

If the new gpt 5 is actually better, then this oss version is not really a threat to Openai's income stream, but it can be a threat to their competitors.

lavezzi · 2025-08-08T00:39:45 1754613585

> Do we know WHY openAI even released them?

Enterprises can now deploy them on AWS and GCP.

mich5632 · 2025-08-07T15:08:40 1754579320

I think this the difference between compute bound pre-fill (a cpu has a high bandwidth/compute ratio), vs decode. The time to first token is below 0.5s - even for a 10k context.

zackify · 2025-08-07T12:08:00 1754568480

You didn’t even mention how it’ll be on fire unless you use low power mode.

Yes all this has been known since the M4 came out. The memory bandwidth is too low.

Try using it with real tasks like cline or opencode and the context length is too long and slow to be practical

Aurornis · 2025-08-07T13:03:27 1754571807

> Yes all this has been known since the M4 came out. The memory bandwidth is too low.

The M4 Max with 128GB of RAM (the part used in the comment) has over 500GB/sec of memory bandwidth.

zackify · 2025-08-08T00:45:43 1754613943

Which is incredibly slow when you’re over 20k context

radarsat1 · 2025-08-07T13:44:37 1754574277

How long did your battery last?!

woleium · 2025-08-07T14:25:14 1754576714

planes have power sockets now, but i do wonder how much jet fuel a whole plane of gpus would consume in electricity (assuming the system can handle it, which seems unlikely) and air conditioning.

TimBurman · 2025-08-07T19:39:25 1754595565

That's an interesting question. According to Rich and Greg's Airplane Page[1], the A320 has three generators rated for 90kVA continuous each, one per engine and a third in the auxilary power unit that isn't normally deployed. Cruising demand is around 140 kVA of the 180 kVA supplied by the engines, leaving 40 kVA to spare. The A380 has six similar generators, two in reserve. They give the percentages so you could calculate how much fuel each system is consuming.

[1] https://alverstokeaviation.blogspot.com/2016/03/

This page also has a rendered image of the generator:

https://aviation.stackexchange.com/questions/43490/how-much-...