> Honestly the absolute revolution for me would be if someone managed to make LL...

snovv_crash · 2026-02-02T20:43:33 1770065013

That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.

The real reason is that every bench I've seen has Anthropic with lower hallucinations.

KellyCriterion · 2026-02-02T17:42:16 1770054136

...or they hallicunate because of floating point issues in parallel execution environments:

https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

cbdevidal · 2026-02-02T12:15:18 1770034518

Holy perverse incentives, Batman