Suppose that the program only needs about 1 MB of objects, but it will fill up 100 MB before performing a copying GC. In mark and sweep, you need to perform an operation for every garbage object (of which there are 99 MB), whereas with copying GC, you need only perform operations for every live object (of which there is 1 MB). Even if copying an object takes ten times as long as returning a chunk of memory to a free-list, the copying algorithm is faster. (Also, popping a free cell off a free-list is more expensive than incrementing an allocation pointer.) Of course, either way, the total work done for allocating objects and freeing them is O([number of objects allocated]).
Basically, instead of marking live data, then sweeping all unmarked data, you just mark live data, then the next time a slot is requested, you step the allocator over the heap until it finds the next unmarked cell and re-use that. If most cells are dead, this happens very quickly. If you have to sweep over too many live cells (left deliberately vague), you grow the heap.
Interesting... I still feel that this is for the most part merely deferring the free-ing work to when you next allocate an object. It may succeed in getting the "free" cost down to a single instruction--checking a mark bit--but it is still a cost per dead object that a copying GC wouldn't have to pay. (The difference may be, as some say, "academic".) It does have the slight advantage of reducing the maximum pause time due to sweeping down to being proportional to live objects rather than garbage.
Also, what if you want to allocate a large object? You might have to skip over some unmarked cells. Either you'd waste the memory, or you would probably want to put them on a free-list of some sort, in which case you are doing more than just bumping a pointer and checking for a mark. ... I see, it says "equally-sized chunks of memory".
Still, that's a cool idea. As mark-and-sweep goes, anyway. :-} I'm rather more fond of copying garbage collectors--I like the determinism, the fact that (normally) the memory address of each object after a full copying GC will depend only on the structure of the graph of objects, and not on the memory addresses of the objects prior to the GC. (In other words, it destroys the information "what the memory address of each object was before this GC". This implies a somewhat higher inherent cost, to eliminate this entropy. See [1] for musings about the relation of this to thermodynamics.) In particular, whether you have fragmentation issues won't depend on the past history of your program. (You could say that the space lost to fragmentation is fixed at a factor of 2.)
Checking the mark bit can be quite cheap, though, particularly if you store the mark bits in their own array - the next several bits will be in-cache. Still, the overall performance is likely to vary quite a bit for either scheme depending on the project's other quirks. Mark-sweep has the advantage that it doesn't need to move objects, but copying is easier with variable-size allocations, etc.
Another disadvantage with lazy sweeping is I don't see a clear way to make it generational. (Right?)
Mark each arena with the last time something was collected from it and just skip sweeping old pages until X cycles have passed? Just thinking out loud... that check wouldn't save time for half-pages of old objects.
The problem with applying Appel's argument in practice is that it only shows that copying garbage collection is better if you are willing to waste RAM. In practice, if you had a 1 MB working set and 100 MB of RAM, you would probably want to use most of the extra 99 MB for caching even slower levels of your storage hierarchy.
Moving garbage collection does have some other potential benefits; it can provide guarantees regarding fragmentation, and it can improve spatial locality of data.
This is why generational GCs are a good idea. You can afford to waste RAM (proportionally) for early generations, because they're small. And for many use cases, this works out great; e.g. in server workloads, most of the allocations made during handling a request will die after the response has been sent. You can get great performance from having sufficient "waste" RAM for (say) 80th percentile total request allocation size times 80th percentile concurrent requests; but this is not normally a huge chunk of total heap size (which will usually be caches of various different kinds, which themselves benefit from a more manual deallocation strategy tied into cache expiration).
Something like this is described in detail here: "Garbage Collection Can Be Faster Than Stack Allocation", Andrew Appel: http://www.cs.princeton.edu/~appel/papers/45.ps