Tuesday 19 September 2017

Implementing post-scarcity hardware

The address space hinted at by using 64 bit cons-space and a 64 bit vector space containing objects each of whose length may be up to 1.4e20 bytes (2^64 of 64 bit words) is so large that a completely populated post-scarcity hardware machine can probably never be built. But that doesn't mean I'm wrong to specify such an address space: if we can make this architecture work for machines that can't (yet, anyway) be built, it will work for machines that can; and, changing the size of the pointers, which one might wish to do for storage economy, can be done with a few edits to consspaceobject.h.

But, for the moment, let's discuss a potential 32 bit PoSH machine, and how it might be built.

Pass one: a literal implementation

Let's say a processing node comprises a two core 32 bit processor, such as an ARM, 4GB of RAM, and a custom router chip. On each node, core zero is the actual processing node, and core one handles communications. We arrange these on a printed circuit board that is 4 nodes by 4 nodes. Each node is connected to the nodes in front, behind, left and right by tracks on the board, and by pins to the nodes on the boards above and below. On the edges of the board, the tracks which have no 'next neighbour' lead to some sort of reasonably high speed bidirectional serial connection - I'm imagining optical fibre (or possibly pairs of optical fibre, one for each direction). These boards are assembled in stacks of four, and the 'up' pins on the top board and the 'down' pins (or sockets) on the bottom board connect to similar high speed serial connectors.

This unit of 4 boards - 64 compute nodes - now forms both a logical and a physical cube. Let's call this cube module a crystal. Connect left to right, top to bottom and back to front, and you have a hypercube. But take another identical crystal, place it along side, connect the right of crystal A to the left of crystal B and the right of B to the left of A, leaving the tops and bottoms and fronts and backs of those crystals still connected to themselves, and you have a larger cuboid with more compute power and address space but slightly lower path efficiency. Continue in this manner until you have four layers of four crystals, and you have a compute unit of 4096 nodes. So the basic 4x4x4 building block - the 'crystal' - is a good place to start, and it is in some measure affordable to build - low numbers of thousands of pounds, even for a prototype.

I imagine you could get away with a two layer board - you might need more, I'm no expert in these things, but the data tracks between nodes can all go on one layer, and then you can have a raster bus on the other layer which carries power, backup data, and common signals (if needed).

So, each node has 4Gb of memory (or more, or less - 4Gb here is just illustrative). How is that memory organised? It could be treated as a heap, or it could be treated as four separate pages, but it must store four logical blocks of data: its own curated conspage, from which other nodes can request copies of objects; its own private housekeeping data (which can also be a conspage, but from which other nodes can't request copies); its cache of copies of data copied from other nodes; and its heap.

Note that a crystal of 64 nodes each with 4Gb or RAM has a total memory of 256Gb, which easily fits onto a single current generation hard disk or SSD module. So I'm envisaging that either the nodes take turns to back up their memory to backing store all the time during normal operation. They (obviously) don't need to backup their cache, since they don't curate it.

What does this cost? About £15 per processor chip, plus £30 for memory, plus the router, which is custom but probably still in tens of pounds, plus a share of the cost of the board; probably under £100 per node, or £6500 for the 'crystal'.

Pass two: a virtual implementation

OK, OK, this crystal cube is a pretty concept, but let's get real. Using one core of each of 64 chips makes the architecture very concrete, but it's not necessarily efficient, either computationally or financially.

64 core ARM chips already exist:

1. Qualcom Hydra - 64 of 64 bit cores;
2. Macom X-Gene - 64 of 64 bit cores;
2. Phytium Mars - 64 cores, but frustratingly this does not say whether cores are 32 or 64 bit

There are other interesting chips which aren't strictly 64 core:

1. Cavium ThunderX - ARM; 96 cores, each 64 bit, in pairs of two, shipping now;
2. Sparc M8 - 32 of 64 bit cores each capable of 8 concurrent threads; shipping now.

Implementing the virtual hypercube

Of course, these chips are not designed as hypercubes. We can't route our own network of physical connections into the chips, so our communications channels have to be virtual. But we can implement a communications channel as a pair of buffers, an 'upstream' buffer writable by the lower-numbered processor and readable by the higher, and a 'downstream' buffer writable by the higher numbered processor and readable by the lower. Each buffer should be at least big enough to write a whole cons page object into, optionally including a cryptographic signature if that is implemented. Each pair of buffers also needs at least four bits of flags, in order to be able, for each direction, to be able to signal

0. Idle - the processor at the receiving end is idle and can accept work;
1. Busy writing - the processor at the sending end is writing data to the buffer, which is not yet complete;
2. Ready to read - the processor at the sending end has written data to the buffer, and it is complete;
3. Read - the processor at the receiving end has read the current contents of the buffer.

Thus I think it takes at least six clock ticks to write the buffer (set busy-writing, copy four 64 bit words into the buffer, set ready-to-read) and five to read it out - again, more if the messages are cryptographically signed - for an eleven clock tick transfer (the buffers may be allocated in main memory, but in practice they will always live in L2 cache). That's probably cheaper than making a stack frame. All communications channels within the 'crystal' cost exactly the same.

But note! As in the virtual design, a single thread cannot at the same time execute user program and listen to communications from neighbours. So a node has to be able to run two threads. Whether that's two threads on a single core, or two cores per node, is a detail. But it makes the ThunderX and Spark M8 designs look particularly interesting.

But note that there's one huge advantage that this single-chip virtual crystal has over the literal design: all cores access the same memory pool. Consequently, vector space objects never have to be passed hop, hop, hop across the communications network, all can be accessed directly; and to pass a list, all you have to pass is its first cons cell. So any S-Expression can be passed from any node to any of its 6 proximal neighbours in one hop.

There are downsides to this, too. While communication inside the crystal is easier and quicker, communication between crystals becomes a lot more complex and I don't yet even have an idea how it might work. Also, contention on the main address bus, with 64 processors all trying to write to and read from the same memory at the same time, is likely to be horrendous, leading to much lower speed than the solution where each node has its own memory.

On a cost side, you probably fit this all onto one printed circuit board as against the 4 of the 'literal' design; the single processor chip is likely to cost around £400; and the memory will probably be a little cheaper than on the literal design; and you don't need the custom routers, or the connection hardware, or the optical transceivers. So the cost probably looks more like £5,000. Note also that this virtual crystal has 64 bit processors (although address bus contention will almost certainly burn all that advantage and more).

Conclusion

An experimental post-scarcity machine can be built now - and I can almost afford to build it. I don't have the skills, of course; but I can learn.


Thursday 14 September 2017

Hardware of the deep future

HAL9000, a visualisation of hardware of the deep future
In thinking about how to write a software architecture that won't quickly become obsolescent, I find that I'm thinking increasingly about the hardware on which it will run.

In Post Scarcity Hardware I envisaged a single privileged node which managed main memory. Since then I've come to thing that this is a brittle design which will lead to bottle necks, and that each cons page will be managed by a separate node. So there needs to be a hardware architecture which provides the shortest possible paths between nodes.

Well, actually... from a software point of view it doesn't matter. From a software point of view, provided it's possible for any node to request a memory item from any other node, that's enough, and, for the software to run (slowly), a linear serial bus would do. But part of the point of this thinking is to design hardware which is orders of magnitude faster than the von Neumann architecture allows. So for performance, cutting the number of hops to a minimum is important.

I've been reading Danny Hillis' thesis and his book The Connection Machine which, it transpires, is closely based on it. Danny Hillis was essentially trying to do what I am trying to do, but forty years ago, with the hardware limitations of forty years ago (but he was trying to do it in the right place, and with a useful amount of money that actually allowed him to build something physical, which I'm never likely to have).

Hillis' solution to the topology problem, as I understand it (and note - I may not understand it very well) is as follows:

If you take a square grid and place a processor at every intersection, it has at most four proximal neighbours, and, for a grid which is x cells in each direction, the longest path between two cells is 2x. If you join the nodes on the left hand edge of the grid to the corresponding nodes on the right hand edge, you have a cylinder, and the longest path between two nodes is 1.5x. If you then join the nodes on the top of the grid to the nodes on the bottom, you have a torus - a figure like a doughnut or a bagel. Every single node has four proximal neighbours, and the longest path between any two nodes is x.

So far so good. Now, let's take square grids and stack them. This gives each node at most six proximal neighbours. We form a cube, and the longest distance between two nodes is 3x. We can link the nodes on the left of the cube to the corresponding nodes on the right and form a (thick walled) cylinder, and the longest distance between two nodes is 2.5x. Now join the nodes at the top of the cube to the corresponding nodes at the bottom, and we have a thick walled torus. The maximum distance between is now 2x.

Let's stop for a moment and think about the difference between logical and physical topology. Suppose we have a printed circuit board with 100 processors on it in a regular grid. We probably could physically bend the circuit board to form a cylinder, but there's no need to do so. We achieve exactly the same connection architecture simply by using wires to connect the left side to the right. And if we use wires to connect those at the top with those at the bottom, we've formed a logical torus even though the board is still flat.

It doesn't even need to be a square board. We could have each processor on a separate board in a rack, with each board having four connectors probably all along the same edge, and use patch wires to connect the boards together into a logical torus.

So when we're converting our cube into a torus, the 'cube' could consist of a vertical stack of square boards each of which has a grid of processors on it. But it could also consist of a stack of boards in a rack, each of which has six connections, patched together to form the logical thick-walled torus. So now lets take additional patch leads and join the nodes that had been on the front of the logical cube to the corresponding nodes on the back of the logical cube, and we have a topology which has some of the properties of a torus and some of the properties of a sphere, and is just mind-bending if you try to visualise it.

This shape is what I believe Hillis means by a hypercube, although I have to say I've never found any of the visualisations of a hypercube in books or on the net at all helpful, and they certainly don't resemble the torusy-spherey thing I which visualise.

It has the very useful property, however, that the longest distance between any two nodes is 1.5x.

Why is 1.5x on the hypercube better than 1x on the torus? Suppose you want to build a machine with about 1000 nodes. The square root of a thousand is just less than 32, so let's throw in an extra 24 nodes to make it a round 32. We can lay out 1024 nodes on a 32 x 32 square, join left to right, top to bottom, and we have a maximum path between two of 1024 nodes of 32 hops. Suppose instead we arrange our processors on ten boards each ten by ten, with vertical wires connecting each processor with the one above it and the one below it, as well tracks on the board linking each with those east, west, north and south. Connect the left hand side to the right, the front to the back and the top to the bottom, and we have a maximum path between any two of 1000 nodes of fifteen hops. That's twice as good.

Obviously, if you increase the number of interconnectors to each processor above six, the paths shorten further but the logical topology becomes even harder to visualise. This doesn't matter - it doesn't actually have to be visualised - but wiring would become a nightmare.

I've been thinking today about topologies which would allow higher numbers of connections and thus shorter paths, and I've come to this tentative conclusion.

I can imagine topologies which tesselate triangle-tetrahedron-hypertetrahedron and pentagon-dodecahedron-hyperdodecahedron. There are possibly others. But the square-cube-hypercube model has one important property that those others don't (or, at least, it isn't obvious to me that they do). In the square-cube-hypercube model, every node can be addressed by a fixed number of coordinates, and the shortest path from any node to any other is absolutely trivial to compute.

From this I conclude that the engineers who went before me - and who were a lot more thoughtful and expert than I am - were probably right: the square-cube-hypercube model, specifically toruses and hypercubes, is the right way to go.