Another great library from the author of pybind11. I was blown away (as a non CS person) that they approached GPU arrays by bypassing CUDA altogether and generating PTX on the fly, followed by JIT compiling it. It could be a trivial concept for more experienced folks here, but it did expand my feeble mind. I was even thinking about trying such an approach in Rust, which I'm learning at the moment.
The author (Wenzel Jakob, also here on HN) is an absolute Legend, and coauthor of the famous 2nd bible of physically based rendering: http://www.pbr-book.org/
He also has his Mitsuba Renderer (for which this library was created), Nori UI library, ... guy is a machine, and extremely personable / friendly in person.
I hope to succeed in marrying his Field-Aligned Remeshing technique to surface refinement mesh-extraction from depth-maps. It's _so_ good that I honestly believe it worth the (significant) effort.
Do not hesitate to communicate on your project if you ever do that (maybe here: http://www.arewelearningyet.com/gpu-computing/), there is clearly space for a great GPU library in Rust.
It seems like it's basically a library for program transformation - more or less two kinds of transformation, parallelization/vectorization and automatic differentiation.
The question I would have is; suppose you have some large program which gets transformed into a system that pipes vectors from location to location, doing operations, how do you deal with issues of data/memory divergence [1]. Basically, how to do you tune the system the data you are using stays in the cache - without consider cache, many advantages of a GPU can be lost. These messy issues tend appear whenever one engages in code generation. Purely piece vectorizations don't have the problem but anything where you're partially reducing and such could have the problem.