The Fool on the Hill: August 2013

Friday, 30 August 2013

Freezer sizing

As I prepare to send the pigs to slaughter, one important issue is how big a freezer to buy.

My estimate of my pigs' live weight, using a well known estimating technique, is about 90 Kg each. Boned out dead weight is likely to be about 66% of that, so say 120 Kg. Of course I shan't keep all that, I'll sell some and give some away; and I shan't freeze all I keep. But given that I won't know how much I'll sell until I sell it, I need to reckon on being able to store 100 Kg of frozen meat.

Flesh just about floats in water; it's within 10% the same density. So a kilogram of meat is about a litre of meat, give or take not very much. But, the packing density will not be perfect; there will inevitably be gaps. Let's say between fudge factors and packing density, I'll need to have space for at least 120 litres.

This year.

Aye, there's the rub. Steerpike - my one steer calf - comes ready for slaughter in the winter of 2014-2015, which is to say just over a year's time. As a 'non-short' Dexter, he's likely by then to have a live weight of at least 250Kg and a boned-out weight of at least 150Kg. Assuming the freezer is empty, I'll need at least 150 litres for him.

So I think I'm looking for a 200 litre freezer.

Sunday, 25 August 2013

Reference counting, and the garbage collection of equal sized objects

Yes, I'm still banging on about ideas provoked by the Wing, White and Singer paper, as presented at Mostly Functional. Brief summary: on Friday I posted an essay on the possible use of cons-space as an optimisation for the JVM memory allocator, for use specifically with functional languages (I'm thinking mainly of Clojure, but it has implications for other things such as Armed Bear Common Lisp, and possibly also Scala). Yesterday, I wrote about maintaining a separate heap for immutable objects, as another optimisation. Today, I'm going to write about reference counting, and where it fits in the mix.

The HotSpot JVM uses a tunable generational garbage collector, as detailed here. Generational garbage collectors are a development of the classic mark-and-sweep garbage collector that was used in early LISP implementations from about 1962.

Mark and Sweep

A mark and sweep garbage collector is triggered when allocatable memory is almost exhausted. When it is triggered, execution of the user program is halted. The garbage collector then

Mark phase:

iterate over every object in the heap, clearing the 'mark bit' in the header of each;
Set the mark bit in the 'root' object;
repeatedly do
for each marked object in the heap,
for each pointer in the object,
set the mark bit in the header of the pointed-to object;
end for each
end for each
until no further objects are marked

Sweep phase:

iterate over objects in the heap again, as follows
for each object,
if there is 'free space' (i.e., objects which were not marked as pointed to in the preceding step) 'below' the object in the heap, then
copy the object as low in free space as is possible
iterate through every object in the whole heap fixing up every pointer which pointed to the object in its old location, to point to its new location.
end if
end for each

Finally, the user program is restarted. Needless to say, all this is expensive in time, and leads to noticable pauses in the running program.

Generational

The 'generational' garbage collector optimises this by observing that in most programs, the majority of objects are short lived, and that therefore it is younger objects which should most aggressively be garbage collected. The heap is divided into (at least two) segments, an 'old generation' segment and a 'young generation' segment. A generation counter is added to the header of each object, initialised to zero when the object is instantiated.

Then, when allocatable memory is almost exhausted, normally only the 'young generation' segment is marked and swept. Each time an object survives garbage collection in the 'young generation' segment, its generation counter is incremented, and when it hits the generation-max value, it is copied from the 'young generation' segment into the 'old generation' segment. However, obviously, when any object is moved, either by the sweep operation in the 'young generation' space or by promotion into the 'old generation' space, the entire heap needs to be checked for pointers which need to be updated.

Finally, if, when doing a promotion, it is found that 'old generation' space is almost exhausted, a full mark and sweep operation is performed on old generation space.

Although this sounds (and is) considerably more complicated than the naive mark and sweep algorithm, the 'new generation' garbage collection operations tend to be relatively quick, leading to less frequent noticeable pauses.

Mark and sweep interacts horribly with paged virtual memory systems but it has to be said that generational isn't a whole lot better here, since there will still be a need repeatedly to access every page on which any part of the heap, no matter how rarely used, is stored.

An aside: the look-aside table

One possible means of avoiding the need to iterate over every pointer in the heap each time an object is moved is to have an indirection table, or look-aside table. This is simply an array of pointers, one pointer for every possible object in the heap. Now 'pointers' within user-program objects are simply indices into the location in the indirection table where the actual pointer is stored. Now, when an object is moved during GC, only the indirection table needs to be updated.

That sounds very appealing and efficient; it clearly saves a very great deal of time. Unfortunately it's inefficient and clumsy in its use of memory. The indirection table once allocated cannot easily be extended, and once entries in the table are exhausted, no more objects can be created even if there it still plenty of free memory left in the heap. Finally, every single object access requires an additional memory lookup. All these things together mean that the look-aside table isn't as much of a win as it seems, and is not often used.

Reference counting

In a reference counting garbage collector, every object header contains a reference counter, initially set to one (since when the object is created, something must point to it). Whenever a new pointer is created to an object, the reference counter on the object is incremented. Whenever something that points to the object is removed from the system, the reference counter on the object is decremented. when the reference counter is decremented to zero, the object is removed from the system, and the reference counters of any objects it pointed to are in their turn decremented.

Let's walk a little over what that means. Typically, reference counting is used in systems which have uniform-sized objects. At system initialisation time, memory is essentially an array of 'empty' objects. Each of these objects is initialised with a header which marks it as being an empty object, with a reference count value of zero. A register or global variable, the 'free list pointer', points to the first of these objects; each object in turn points to the next object.

In use, when a memory object is allocated, it is popped off the front of the free list; when a memory object is deallocated, it is pushed back on the front of the free list. Because all the objects are equal sized, any object can be initialised in the space left by any other, so there's never any need to move things. And if memory allocated to the user program becomes exhausted, provided the operating system can allocate more memory it can be initialised and added to the end of the free list at any point - it does not have to be contiguous with the existing memory allocation.

So, lots and lots of win. There's never any significant pause for garbage collection. Nothing ever has to be moved, so there's no problem with fixing up pointers. Why doesn't every system do it this way?

Because, sadly, in a normal useful user program there's a need for unequal sized objects. Yes, strings can be represented as lists of characters; raster images can represented as lists of lists of bits. But this is hopelessly inefficient. So it's much better to maintain a heap for variable sized data. Of course, the pages of equal sized objects - 'cons space' - can themselves float in the heap, so there's no ideological problem with having variable sized data in a reference counter system.

Typically, for each 'heap space object', you'll have a proxy pointer object in cons space. Other things which reference the heap space object will actually hold a pointer to its proxy, and the proxy will have the usual header fields of a cons-space object including the reference counter. The type field in its header will indicate that it is a proxy for the heap space object, and a pointer in its body will point to the actual heap space object.

Problems

There are still a few problems with this solution, most of which affect long-running programs. The first is, no matter how many bits you allocate to the reference counter, there is a maximum value it can store. What happens when an object has more references to it than its reference counter can store? Well, obviously, it can't be incremented further, because it would wrap around and you'd end with a mess. But, equally, it can't be decremented - because you don't know how many times to decrement. So once an object has reached the maximum reference value, it can never be removed from the system by the normal operation of reference counting memory manager.

More subtly, circular data structures can never be removed from the system even if nothing outside the circle any longer references it, since each element holds a pointer to the next and none can ever be decremented to zero. This, however, can't happen in a pure functional language with immutable data objects, since circular data structures are then impossible to create.

Finally, while the user program will not have to be paused for mark-and-sweep, occasionally the deletion of an object which serves as the root of a deep tree of objects will cause a cascade of further deletions, which may also cause a noticeable pause.

Heap space, of course, will fragment, as heap space of any variable-size-object system always does. But heap space can be compacted using a very infrequent mark-and-sweep, and, in any case, this problem isn't special to reference counting systems.

Conclusion

In summary, especially for systems with very large numbers of equal-sized objects such as is typical of programs written in LISP-like languages, reference counting garbage collectors have always struck me as having many benefits. Adding reference counting to any Java Virtual Machine would be decidedly non-trivial, however, and, in particular, using proxy objects in cons-space to point to heap space objects might (I don't yet know) break compatibility with existing compiled Java programs. Also, while a reference counting system may have fewer noticeable pauses, its overall efficiency is not necessarily better than a generational system. It's more 'worth a try' than 'a certain win'.

Saturday, 24 August 2013

The immutable pool: more on optimising memory management for functional languages

Further to yesterday's note on optimising the Java Runtime Environment's memory allocator for the code generated by functional language compilers, I've been reading up on the memory allocator in the OpenJDK Java platform implementation.

First, a note about nomenclature. To my mind the 'Java Virtual Machine' is simply a processor which processes instruction codes - as it were, something in the same category as an ARM 6 or an Intel 80486, except implemented in software. To me it's clear that memory management is not 'part of' that device, it's a low level library expected to be available as part of the runtime environment. However, the writers of the HotSpot VM documentation don't see it that way. To them, the memory allocator is part of the virtual machine, not part of the runtime environment, and as I'm playing with their ball I shall try in what follows to stick to their preferred nomenclature.

The HotSpot memory allocator operates a per-thread generational garbage collector, which depends on

... the second part of the weak generational hypothesis: that there are few references from old objects to young objects.

The generational garbage collector is in fact much more complex precisely because there are some references from an older object to younger objects. The collector uses two pools of memory, a 'young generation' pool and an 'old generation' (or 'tenured') pool (actually it's considerably more subtle than that, but that's enough detail for the present discussion). Churn and froth are expected in the young generation pool, so it is garbage collected regularly. An object which survives more than a short while in the young generation pool is promoted or scavenged into the old generation pool by copying it; and when this is done, it is necessary to scan through the whole old generation pool (as well as the young generation pool) to fix up any pointers which pointed to its old location in the young generation pool so that they now point to its new location in the old generation pool. That scan is inevitably costly: it must visit every single pointer, and compare its value to the value to be changed.

So far so good. But, how come there are any pointers in the old generation pool which point to newly created objects in the young generation pool? There are, because objects in Java are mutable. We can do the equivalent of RPLACA and RPLACD operations - we can destructively overwrite pointers - and in fact the Java imperative object oriented paradigm encourages us to do so.

You can write classes of immutable objects in Java; if you declared each instance variable of a class (and each instance variable of every class from which it inherits) to be final, then every object of that class would be immutable. I've never seen it done. But in pure functional languages, all data items are immutable. You cannot overwrite pointers.

Of course, for many purposes holding state information is pretty vital, or at least it's hard for us, educated as we have been in imperative languages, not to see it as such. So Clojure, for example, while holding to the mantra that data is immutable, has a special reserved area of Software Transactional Memory in which mutable data may be stored, and also handles things like I/O by invoking Java classes expected to be present in the environment which do use mutable data. Nevertheless, programs compiled with the Clojure compiler can be expected to generate a very high proportion of objects with are immutable.

So the question arises, at what density of immutable objects does the idea of having an additional pool for 'old generation immutable' objects become a win? Remember that an older, immutable object cannot hold a pointer to a younger object (whether mutable or not), because the younger object did not exist when the older object was created. So immutable 'old generation' objects do not need to be scanned and 'fixed up' when a surviving 'young generation' object is scavenged into the old generation pool. Given that the scan operation is expensive, it does not seem to me that the density of immutable objects would need to be very high before this was a win.

For clarity, the algorithm for the revised 'scavenger' part of the garbage collector could be as follows:

for each object in the 'young generation' pool
if it has survived long enough to be worthy of scavenging (no change to that part of the algorithm) then
if mutable flag set, then
copy into the 'old generation mutable' pool (i.e. no change to what happens now).
else
copy into the 'old generation immutable' pool
end if
scan remaining objects in the 'young generation' pool and fix up pointers (since objects in the young generation pool, even if immutable, may reference the object just promoted - again, no change to what happens now)
scan all objects in the 'old generation mutable' pool and fix up pointers.
end if
end loop

There's no need to scan anything in the 'old generation immutable' pool, because it cannot hold pointers to the promoted object. For the same reason there's no need to scan the 'old generation immutable' pool during the mark phase of a generational mark-and-sweep operation, which necessarily happens before each group of salvage operations.

So the tradeoff is one 'check flag and branch' operation for each object promoted, against scanning the whole old generation space including potentially many immutable objects. I'd guess that by 10% immutable objects you'd have a performance win, and that for higher immutable object densities you could have an even more important win.

The fly in the ointment is that you need a flag bit on the header of every object. The OpenJDK object header currently contains a (by implication 32 bit) 'mark word', which is used precisely for flags used by the memory allocation system, but I haven't yet found where those flags are documented or whether any are free to be used.

Finally, this optimisation - while both potentially bigger and easier to implement than the cons-space optimisation I suggested yesterday - is independent of it. Both optimisations together potentially offer a still bigger win.

Friday, 23 August 2013

Functional languages, memory management, and modern language runtimes

At the Mostly Functional workshop at the Turing Festival in Edinburgh yesterday, I attended a very interesting presentation on the performance of compilers of functional languages for the JVM by Wing Hang Li and Jeremy Singer. The whole talk was interesting, but some of the graphs were electrifying.

What Wing had done was analyse the emitted code size of methods emitted by the different compilers (Scala, Clojure, Jython and JRuby). These showed that the code emitted by these compilers was systematically different - very different - from code emitted by the Java compiler. Typically, they generated smaller code units ('methods', in the terminology of the presentation) - this especially true of Scala - and made much more use of stack. But of more interest to me was this:

"We examine the distribution of object sizes, weighted by their dynamic allocation frequency. The size of java.lang.Object is 16 bytes for the JVM we use. The boxplots in Figure 6 show that the object size for most non-Java JVM languages is dominated by only one or two sizes. This can be seen from the median object size in the unfiltered JRuby and Jython boxplots and the filtered Clojure and Scala boxplots. However, the median object size for Java varies between 24 to 48 bytes. By comparing the unfiltered and filtered boxplots, we see that Clojure and Scala use smaller objects more frequently than Java."

This is kind of what one would expect. Functional languages should (hopefully) encourage programmers to use smaller units of code than imperative languages; and, because functional programming paradigms make much more use of recursion than typical imperative paradigms, you'd expect to see more use of stack. But more significantly, a great deal of the memory allocation is likely to be small fixed size objects (CONS cells, and other things like e.g. ints and doubles which will fit into the memory footprint of a CONS cell), and that furthermore these small objects are likely to be allocated and deallocated much more frequently than larger objects. And this takes me right back into ideas about the design of LISP runtimes that I was interested in twenty five years ago.

Given this pattern of a rapid churn of small fixed-size objects a naive heap allocator will tend to fragment the heap, as small but still-live objects become sparsely scattered through heap space, ultimately requiring a full mark-and-sweep before larger objects can be allocated. Now I'm certain that the Java heap allocator is anything but naive, but it's unlikely to be optimised for large numbers of rapidly allocated and deallocated equal sized objects.

Some earlier LISPs divided memory into 'cons space' and 'heap space'. 'Cons space' was essentially a set of pages themselves allocated within heap space, each of which contained an array of cons cells. When a cons-space page was allocated, each of its cells would be linked together onto the free list. When a cons cell was allocated, it would be popped off the freelist and the freelist head pointer updated from its CDR; when a cons cell was deallocated, it was simply pushed back onto the freelist and the freelist head pointer updated to point to it. When cons space was exhausted, a new page was allocated from the heap. This strategy works with both mark-and-sweep and reference-counting strategies, although I'm most familiar with it in the reference-counting context.

This makes the allocation of objects of size equal to or smaller than a cons cell extremely cheap and fast, and avoids heap fragmentation. A cons cell comprises two words the width of the address bus, plus a header containing e.g. type flags, GC flag and reference count; a number of other data objects such as, e.g., Integers, Doubles and other boxed primitives, easily fit within this footprint.

Wing and Singer's enquiry, as I understand it, is whether special tuning of the JIT could improve performance of the Java Virtual Machine for non-Java languages. Brief summary, the 'JIT' (Just in time) compiler is an element of the Java Virtual Machine implementation for a particular concrete processor, which translates fragments of JVM object code into optimised object code for the concrete processor. The JIT is part of the Java Runtime Environment (JRE), not of the Java compiler, because this detail tuning of the object code happens at runtime for specific patterns in the code being executed on the specific concrete processor. Because the code fragments emitted by the different functional-language compilers are systematically different from those emitted by the Java compiler, there may be merit in this special tuning.

But, since the Java Runtime Environment comprises not just the JVM but also other components including, critically, the memory management subsystem, it occurs to me that given this very different memory usage pattern, a custom memory manager - implementing specifically a separate cons-space allocated as pages in the heap, using reference counts and a free-list - might well be an even bigger win. Furthermore, unlike the JIT tuning suggested by Wing and Singer, the memory manager tuning would be portable between different concrete processor families.

Wing and Singer's proposed change to the JIT would not prevent the JRE running any arbitrary JVM program, nor should any arbitrary JVM program run less efficiently than on a 'vanilla flavour' JRE. Neither (I hope) would my proposed change to the memory manager. Of course, the adapted JRE would be somewhat larger; you could describe this as code bloat. Of course, if all you want to run on your JVM is Java code, this adapted JRE would be of no benefit.

All this applies, of course, not only to the Java Runtime Environment. It almost certainly applies equally to the Common Language Runtime used in the .Net environment, to the runtime environment Erlang virtual machine (targeted by for example Joxa), and probably others.

However, it would not be trivial to retrofit this. The Clojure cons cell is a POJO ('plain old Java object'), allocated and deallocated by standard Java mechanisms. Joxa on the Erlang VM is similar, and I should be surprised if ClojureCLR on the .Net Common Language Runtime is much different. As any arbitrary object may hold pointers to cons cells, attempting to do special memory management for cons cells would require considerable rewriting of the memory management subsystems, at the runtime environment level. I say again, I do not believe the memory manager in the JRE is by any means naive. It is very highly tuned code written by very able engineers and tested widely across the world over a number of years.

Even supposing - as Wing, White and Singer's paper does suggest - that the Java Runtime Environment is not optimally tuned to the code patterns and object patterns of functional languages, it doesn't necessarily follow that changing it would improve things. But, the OpenJDK is just that - open. It would be possible to clone it and experiment. It would be possible to produce a variant JRE, either with specific tuning for specific things as I've described, or designed to allow modular selection and replacement of JIT and memory manager components at run time, and, if benchmarking proved the variant implementation more efficient at executing functional language programs while no less efficient at executing Java language programs, it might be worth contributing the changes back.

The fool on the hill by Simon Brooke is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License