Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes that's what I had in mind. Interesting that they can get decent convergence even with 100x, considering that to get a decently accurate gradient in all D dimensions you need to sample D forward passes, and we are talking of D in billions, and each forward pass involves billions of FLOPs. And then it's still an estimate since, as you say, it is not an infinitesimal delta.

Interesting that it converges when just sampling one point in one direction per iteration.



Right — that’s certainly surprising & intriguing. So they discuss this in section 4, and offer a theoretical argument why the rate of convergence might be independent of the (large) number of parameters. I haven’t grokked that yet, but maybe one could think of this as a consequence of the shape of the landscape (cost function) in the overparametrized regime.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: