I saw their talk at Siggraph in 2012. This is really interesting work.
The target image processing algorithms are low level. Think "How can I write the fastest 3x3 box blur for Sandy Bridge and later architectures", not "How can I speed up my face detector".
Examples of scheduling decisions that Halide deals with:
process image in 4x4? 8x8? 16x16 chunks?
use a temporary buffer for an intermediate processing stage, causing worse cache behavior but reusing results that are needed more than once?
use a sliding window the size of a couple rows as a compromise?
This kind of work means the difference between a laggy or interactive Gaussian Blur radius slider in Photoshop.
The Halide code shown at the talk was replacing really hard-to-read code with lots of SIMD compiler intrinsics. Dozens of lines of code doing something that would be 5 lines in a naive implementation. With Halide, it's almost as readable as the naive version because the scheduling stuff is separated from the algorithm.
For an application like Photoshop, this is a big win because they will never choose code readability over performance. Performance wins every time. If they can get the same performance with readable code, they are very happy.
GPU code generation falls naturally out of the scheduling intelligence and restricted problem domain.
I have never used Halide, so I do not intend to endorse it, but this line of inquiry is absolutely useful for a certain niche of programming.
As a computer vision researcher, this looks very interesting, although somehow I have yet to understand how they want to generalize highly complicated optimization patterns (access order, locality, pipelines, platform limitations ...), especially since some algorithms (other than the shown blur filter) require quite complicated access patterns on the image data and can only be hand optimized most of the time (that doesn't imply that they would not benefit from general optimization at all, just that they might be way faster when hand optimized). Still, if Halide produces faster code for some cases (e.g. filter operations amongst others), it will still be worth its salt.
I went to one of the SIGGRAPH talks they did a couple years ago.
The theoretical plan is basically to write an optimizer that could intuit good schedules, recognizing that they actually don't have a good idea of how to do this. I think they can currently run some ML algorithms that churn through a bunch of different schedules and find the one most fit for a problem but it's rather brute force and slow at the moment.
That said, the conceptual distinction of separating the algorithm from the scheduling also allows you to tune scheduling by hand much easier than would be possible otherwise.
Since this is a language and a compiler, my guess would be the answer to your question is: the compiler will optimize for the underlying platform. The whole point of Halide is stated in their abstract: "... make it easier to write high-performance image processing code" which is the exact opposite of "hand optimization". Halide allows developers to express what to do in a powerful, domain specific language - the compiler takes care of the "how".
This approach makes a lot of sense: abstract the annoying low level architecture details. They have a lot of targets which is fantastic: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL. Let the architecture specialist worry about the architecture specifics. The disadvantage: the achieved performance then depends on quality and wisdom of the compiler. But once certain things are optimized for a specific architecture, every user will benefit.
How they do it on the compiler end of things, I'm not sure. There are a number of techniques. Among the simpler is auto-tuning. There is also a new term: "copious-parallelism" [0]. It acknowledges that to achieve performance portability across platforms, algorithms must offer explicit ways of parametrization and tuning to adapt to different platforms. I think this is the right concept but believe that it could be implemented within the compiler. The domain specialist should not have to think about those things.
The paper was specifically about hand doing your schedules, not the compiler doing them automagically, which is a huge pipe dream at any rate. For the programs in the class, you are looking at a few orders of magnitude in perf differences for different schedules, which is why the programmer needs control so they can be guaranteed the performance they are expecting. Compilers only reliably optimize at the few percentage point nice to have level.
You're right, I completely misunderstood the purpose of Halide. I read up on it and I see now that they simplify how developers can do the copious implementation by hand. The schedule must be specified by the developer, the compiler doesn't to that.
It looks like they indeed also had a DSL for image processing. Despite what the Halide presentation may be emphasizing, its novelty is not so much in the DSL than the superoptimizer that comes with it, though.
Now, superoptimization is not a particularly new topic either (see e.g. [1]), but Halide demonstrated that improvements can be significant in this domain. Far more impressive than, e.g., superoptimization applied to generic assembly code (hasn't really worked so far), query optimization, etc.
I might be a stickler here, but this strikes me as a good example of a "domain-specific language" rather than of a general programming language.
Now, it's certainly possible to nail general-purpose programming features onto a DSL (e.g. matlab or R), but the approach (also used by Halide) to embed the DSL into a general-purpose language that allows you to do so (here:C++) is, in my eyes, vastly better suited for scaling up from experimentation code to something that can become part of a larger system (think OpenCV in robotics etc.)
They don't advertise it as a general programming language. The end of the video they specifically state it only works with optimizing algorithms that deal with images. General stuff like trees, they have no application towards.
The general domain of applicability is data parallel processing. There is very little that is image specific in Halide. So audio, dense linear algebra, some finite element applications, etc. would also be well within the domain.
We evaluated this some time ago in the company I used to work but it doesn't have intel compiler nor ms compiler support so it was an instant no go. Anyway it is an interesting idea.
You should be able to link generated Halide code(1) with either compiler. (Using Halide generated code does not impose much at all in the way of requirements other than C-level ABI compatibility and basic library support.)
A fair bit of work has recently been done on making Halide work well on the Windows platform. We are erring toward requiring newer versions of Visual Studio however. (E.g. there are advantages on being able to rely on C++11 support for generating Halide code.)
I do not see any issues filed on things not working with Intel's compiler. Feel free to file one. Halide is an open source project and both bug reports and patches help a great deal.
(1) Halide can operate as either a Just In Time (JIT) compiler or an Ahead Of Time (AOT) compiler. Generally AOT is much more convenient for deploying in production. The end result of of compiling a single filter in Halide is an object file and a header file a single C-linkage function with all of its parameters passed in on the call. The minimal runtime is included in the object file and dependencies on libraries are also quite minimal, things like libc and pthreads. Functionality provided by the runtime can easily be overridden by providing one's own definitions for various functions, as weak linkage is used.
There is also a set of Python bindings, which can be used to write Halide code which is then AOT compiled. The AOT compiled object file can be linked into pretty much anything that can handle C functions either via direct linkage or a foreign function interface.
Can you combine this with AVISynth for rendering videos? It would be cool to be able to completely bypass Lightroom when I need to process individual images and do it instead in a single script.
One wonders if you could build an OpenCV interface using Halide to maintain compatibility but also the potential for optimizing across OpenCV function calls?
I want to write in OpenCV, but I also want the compiler to fix everything across function boundaries :-)
I cannot shake the feeling that this is just yet another iteration of the usual concept.
If you look at Google Renderscript, Microsoft C++AMP, NV Cuda, OpenCL and whatnot you'll realize that they all do exactly the same thing, difference being only in syntax level (with few caveats, but those are relatively minor). Some being more cumbersome to use than others.
Halide looks neat to use syntax wise but it doesn't seem to make the actual algorithms any easier to write. You still have to do the same mental work on same level as you have to do with OpenCL.
All of these are to eachother what Fortran, Pascal and C are to eachother. Same basic idea in different package. I'm waiting for the first system that is equivalent of C++. Something that really brings new concepts to the table instead of just a different syntax.
This comment completely misses the fundamental innovation in Halide, which is a separation of "algorithm," a technical term denoting the part of the program that defines the computation, and "schedule," the part of the program which defines the mapping to hardware resources.
The end result is not like OpenCL, CUDA/PTX, or RenderScript. (All of which we regard as suitable target languages for Halide. PTX (CUDA's low-level assembly language) and OpenCL are currently supported.) As a concrete example, there is no explicit memory allocation in Halide and loops are often implicit. The overall character of Halide is that of a functional programming language.
As another example, segmentation violations due to errors in Halide code are impossible, barring bugs in the compiler implementation or errors in user provided C++ code, which Halide has no control over. Some set of bounds errors are reported at compile time and the rest turn into (efficient) assertion checks at runtime.
Schedules do get pretty low-level and writing them is difficult. Algorithms are generally a fair bit easier to write and while there is plenty to be done in improving the language, it is already a step up in productivity over e.g. OpenCL or even straight C++ for a lot of tasks within the domain of data parallel processing.
I'm waiting for the first system that is equivalent of C++.
Not gonna happen anytime soon. C++ is much more than just a programming language, it is an entire "ecosystem". You have toolchains, libraries, drivers, APIs, existing software stacks, all written in C++ and able to interface with C/C++ directly, i.e. no translation layer needed. You're not just gonna replace that with a new language.
Sure, Halide may be seen as just some syntactic sugar (much like quite a few bits and pieces of C++11), but it actually provides you with a different level of abstraction than say OpenCl, MPI or OpenMP where you are very specific about e.g. level of concurrency (which heavily impacts the design of your algorithm) while Halide tries to almost completely separate algorithm and scheduling.
I maybe wrote it badly. Skrebbel said it a lot better. It's about providing something that's on truly higher level.
It's nice that Halide attempts to separate the algorithm and scheduling, however as of now it's also possible in OpenCL too. The implementations are just so bad at it that it's not really useful.
> I'm waiting for the first system that is equivalent of C++. Something that really brings new concepts to the table instead of just a different syntax.
Obligatory HN reaction: I'd wait it out for the first Lisp to appear.
I haven't yet given it a spin but it won't be fair to call it just syntactic sugar. The value it provides, besides a little easier to write syntax is abstracting kernel level code. You don't need to specify block/grid size, move memories around CPU/GPU but I doubt if it can match the performance of hand-written OpenCL/CUDA kernels when it becomes to bleeding-edge Imaging performance.
Did you watch the video? Did you read the article?
This is about optimizing memory locality and parralelism. Its not about getting access to the underlying hardware such as Microsoft C++AMP, NV Cuda, OpenCL.
I did watch the video. It is a neat idea. However it is something that is already made explicitly possible by OpenCL.
If you write a kernel which doesn't use local memory nor doesn't use the local_id it produces a kernel that is effectively a Halide pipeline stage. The points can be evaluated in arbitrary order (spec says everything is implementation defined).
If we look at the blur example on the video the OpenCL implementation is also free to effectively merge the stages like in Halide. It's because the spec only defines that whatever the previous kernel invocation has written must be visible for the next kernel. Nothing more and nothing less.
Sure OpenCL allows you to fiddle around with low level details, but it also allows you to write completely platform neutral code that is then the responsibility of the platform to actually optimize.
I do agree that Halide allows you to easily explore the different scheduling options, that's something OpenCL is not capable of. In OpenCL you either do it manually or leave it as totally defined by the implementation.
"OpenCL implementation is also free to effectively merge the stages like in Halide."
I gave up relying on magic compilers a long time ago. And having worked in this domain for a long time I'm actually offended by people who write off the problem as simply a matter of a good enough optimizer. This has significantly held back both performance and portability in imaging. (And likely other areas.)
Halide is not magic, it is just a better slicing of the problem backed up by a good implementation. As always there is no free lunch but when it comes down to actually shipping this kind of code across a wide variety of platforms with great performance, it is a lot more productive than anything else out there.
I do agree that relying on compiler optimizations is a waste of time. They never seem to appear. I simply wished to point out that you can write Halide like code also in OpenCL. And I'd love to see an implementation which would attempt similar style of optimizations what Halide allows.
Halide is extremely domain specific, which is a good thing, it allows them to focus on the problem at hand, namely on how to easily write image processing filters that can be made performant with relative ease. However I would not wish to write a bitonic sort or anything like that in Halide.
As I wrote in another comment, the domain of problems for which Halide works is broader than imaging. I usually present it as "data parallel problems." In fact, I'd say the difference in domain between what Halide is good at and what OpenCL and CUDA are good at is not that significant in practice because those languages are basically C/C++ outside of kernel parallelism. (They are each adding some task parallelism facilities as well.)
The target image processing algorithms are low level. Think "How can I write the fastest 3x3 box blur for Sandy Bridge and later architectures", not "How can I speed up my face detector".
Examples of scheduling decisions that Halide deals with: process image in 4x4? 8x8? 16x16 chunks? use a temporary buffer for an intermediate processing stage, causing worse cache behavior but reusing results that are needed more than once? use a sliding window the size of a couple rows as a compromise?
This kind of work means the difference between a laggy or interactive Gaussian Blur radius slider in Photoshop.
The Halide code shown at the talk was replacing really hard-to-read code with lots of SIMD compiler intrinsics. Dozens of lines of code doing something that would be 5 lines in a naive implementation. With Halide, it's almost as readable as the naive version because the scheduling stuff is separated from the algorithm.
For an application like Photoshop, this is a big win because they will never choose code readability over performance. Performance wins every time. If they can get the same performance with readable code, they are very happy.
GPU code generation falls naturally out of the scheduling intelligence and restricted problem domain.
I have never used Halide, so I do not intend to endorse it, but this line of inquiry is absolutely useful for a certain niche of programming.