Friday 23 August 2013

Functional languages, memory management, and modern language runtimes

At the Mostly Functional workshop at the Turing Festival in Edinburgh yesterday, I attended a very interesting presentation on the performance of compilers of functional languages for the JVM by Wing Hang Li and Jeremy Singer. The whole talk was interesting, but some of the graphs were electrifying.

What Wing had done was analyse the emitted code size of methods emitted by the different compilers (Scala, Clojure, Jython and JRuby). These showed that the code emitted by these compilers was systematically different - very different - from code emitted by the Java compiler. Typically, they generated smaller code units ('methods', in the terminology of the presentation) - this especially true of Scala - and made much more use of stack. But of more interest to me was this:
"We examine the distribution of object sizes, weighted by their dynamic allocation frequency. The size of java.lang.Object is 16 bytes for the JVM we use. The boxplots in Figure 6 show that the object size for most non-Java JVM languages is dominated by only one or two sizes. This can be seen from the median object size in the unfiltered JRuby and Jython boxplots and the filtered Clojure and Scala boxplots. However, the median object size for Java varies between 24 to 48 bytes. By comparing the unfiltered and filtered boxplots, we see that Clojure and Scala use smaller objects more frequently than Java."
This is kind of what one would expect. Functional languages should (hopefully) encourage programmers to use smaller units of code than imperative languages; and, because functional programming paradigms make much more use of recursion than typical imperative paradigms, you'd expect to see more use of stack. But more significantly, a great deal of the memory allocation is likely to be small fixed size objects (CONS cells, and other things like e.g. ints and doubles which will fit into the memory footprint of a CONS cell), and that furthermore these small objects are likely to be allocated and deallocated much more frequently than larger objects. And this takes me right back into ideas about the design of LISP runtimes that I was interested in twenty five years ago.

Given this pattern of a rapid churn of small fixed-size objects a naive heap allocator will tend to fragment the heap, as small but still-live objects become sparsely scattered through heap space, ultimately requiring a full mark-and-sweep before larger objects can be allocated. Now I'm certain that the Java heap allocator is anything but naive, but it's unlikely to be optimised for large numbers of rapidly allocated and deallocated equal sized objects.

Some earlier LISPs divided memory into 'cons space' and 'heap space'. 'Cons space' was essentially a set of pages themselves allocated within heap space, each of which contained an array of cons cells. When a cons-space page was allocated, each of its cells would be linked together onto the free list. When a cons cell was allocated, it would be popped off the freelist and the freelist head pointer updated from its CDR; when a cons cell was deallocated, it was simply pushed back onto the freelist and the freelist head pointer updated to point to it. When cons space was exhausted, a new page was allocated from the heap. This strategy works with both mark-and-sweep and reference-counting strategies, although I'm most familiar with it in the reference-counting context.

This makes the allocation of objects of size equal to or smaller than a cons cell extremely cheap and fast, and avoids heap fragmentation. A cons cell comprises two words the width of the address bus, plus a header containing e.g. type flags, GC flag and reference count; a number of other data objects such as, e.g., Integers, Doubles and other boxed primitives, easily fit within this footprint.

Wing and Singer's enquiry, as I understand it, is whether special tuning of the JIT could improve performance of the Java Virtual Machine for non-Java languages. Brief summary, the 'JIT' (Just in time) compiler is an element of the Java Virtual Machine implementation for a particular concrete processor, which translates fragments of JVM object code into optimised object code for the concrete processor. The JIT is part of the Java Runtime Environment (JRE), not of the Java compiler, because this detail tuning of the object code happens at runtime for specific patterns in the code being executed on the specific concrete processor. Because the code fragments emitted by the different functional-language compilers are systematically different from those emitted by the Java compiler, there may be merit in this special tuning.

But, since the Java Runtime Environment comprises not just the JVM but also other components including, critically, the memory management subsystem, it occurs to me that given this very different memory usage pattern, a custom memory manager - implementing specifically a separate cons-space allocated as pages in the heap, using reference counts and a free-list - might well be an even bigger win. Furthermore, unlike the JIT tuning suggested by Wing and Singer, the memory manager tuning would be portable between different concrete processor families.

Wing and Singer's proposed change to the JIT would not prevent the JRE running any arbitrary JVM program, nor should any arbitrary JVM program run less efficiently than on a 'vanilla flavour' JRE. Neither (I hope) would my proposed change to the memory manager. Of course, the adapted JRE would be somewhat larger; you could describe this as code bloat. Of course, if all you want to run on your JVM is Java code, this adapted JRE would be of no benefit.

All this applies, of course, not only to the Java Runtime Environment. It almost certainly applies equally to the Common Language Runtime used in the .Net environment, to the runtime environment Erlang virtual machine (targeted by for example Joxa), and probably others.

However, it would not be trivial to retrofit this. The Clojure cons cell is a POJO ('plain old Java object'), allocated and deallocated by standard Java mechanisms. Joxa on the Erlang VM is similar, and I should be surprised if ClojureCLR on the .Net Common Language Runtime is much different. As any arbitrary object may hold pointers to cons cells, attempting to do special memory management for cons cells would require considerable rewriting of the memory management subsystems, at the runtime environment level. I say again, I do not believe the memory manager in the JRE is by any means naive. It is very highly tuned code written by very able engineers and tested widely across the world over a number of years.

Even supposing - as Wing, White and Singer's paper does suggest - that the Java Runtime Environment is not optimally tuned to the code patterns and object patterns of functional languages, it doesn't necessarily follow that changing it would improve things. But, the OpenJDK is just that - open. It would be possible to clone it and experiment. It would be possible to produce a variant JRE, either with specific tuning for specific things as I've described, or designed to allow modular selection and replacement of JIT and memory manager components at run time, and, if benchmarking proved the variant implementation more efficient at executing functional language programs while no less efficient at executing Java language programs, it might be worth contributing the changes back.

No comments:

Creative Commons Licence
The fool on the hill by Simon Brooke is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License