Friday 25 August 2017

Riddles in the dark

I'm continuing my series of essays on the future of news; this essay may look like a serious digression, but trust me, it's not. The future of news, in my opinion, is about cooperation; it's about allowing multiple voices to work together on the same story; it's about allowing users to critically evaluate different versions of the story to evaluate which is more trustworthy. This essay introduces some of the technical underpinnings that may make that possible.

In January 2015, I ran a workshop in Birnam on Land Reform. While planning the workshop, I realised I would need a mechanism for people who had attended the workshop to work cooperatively on the document which was the output of the event. So, obviously, I needed a Wiki. I could have put the wiki on a commercial Wiki site like Wikia, but if I wanted to control who could edit it I'd have to pay, and in any case it would be plastered with advertising.

So I decided to install a Wiki engine on my own server. I needed a Wiki engine which was small and easy to manage; which didn't have too many software dependencies; which was easy to back up; and on which I could control who could edit pages. So I did a quick survey of what was available.

Wiki engines store documents, and versions of those documents. As many people work on those documents, the version history can get complex. Most Wiki engines store their documents in relational databases. Relational databases are large, complex bits of software, and when they need to be upgraded it can be a pain. Furthermore, relational databases are not designed to store documents; they're at there best with fixed size data records. A Wiki backed by a relational database works, but it isn't a good fit.

Software is composed of documents too - and the documents of which software is composed are often revised many, many times. In 2012 there were 37,000 documents, containing 15,004,006 lines of code (it's now over twenty million lines of code). It's worked on by over 1,000 developers, many of whom are volunteers. Managing all that is a considerable task; it needs a revision control system.

In the early days, when Linus Torvalds, the original author of Linux, was working with a small group of other volunteers on the early versions of the kernel, revision control was managed in the revision control system we all used back in the 1980s and 90s: the Concurrent Versions System, or CVS. CVS has a lot of faults but it was a good system for small teams to use. However, Linux soon outgrew CVS, and so the kernel developers switched to a proprietary revision control system, Bitkeeper, which the copyright holders of that system allowed them to use for free.

However, in 2005, the copyright holders withdrew their permission to use Bitkeeper, claiming that some of the kernel developers had tried to reverse engineer it. This caused a crisis, because there was then no other revision control system then available sophisticated enough to manage distributed development on the scale of the Linux kernel.

What makes Linus Torvalds one of the great software engineers of all time is what he did then. He sat down on 3rd April 2005 to write a new revision control system from scratch, to manage the largest software development project in the world. By the 29th, he'd finished, and Git was working.

Git is extremely reliable, extremely secure, extremely sophisticated, extremely efficient - and at the same time, it's small. And what it does is manage revisions of documents. A Wiki must manage revisions of documents too - and display them. A Wiki backed with git as a document store looked a very good fit, and I found one: a thing called Gollum.

Gollum did everything I wanted, except that it did not authenticate users so that I couldn't control who could edit documents. Gollum is written in a language called Ruby, which I don't like. So I thought about whether it was worth trying to write an authentication module in Ruby, or simply start again from scratch in a language I do like - Clojure.

I started working on my new wiki, Smeagol, on 10th November 2014, and on the following day it was working. On the 14th, it was live on the Internet, acting as the website for the Birnam workshop.

It was, at that stage, crude. Authentication worked, but the security wasn't very good. The edit screen was very basic. But it was good enough. I've done a bit more work on it in the intervening time; it's now very much more secure, it shows changes between different versions of a document better (but still not as well as I'd like), and it has one or two other technically interesting features. It's almost at a stage of being 'finished', and is, in fact, already used by quite a lot of other people around the world.

But Smeagol exploits only a small part of the power of Git. Smeagol tracks versions of documents, and allows them to be backed up easily; it allows multiple people to edit them. But it doesn't allow multiple concurrent versions of the same document; in Git terms, it maintains only one branch. And Smeagol currently allows you to compare a version of the document only with the most recent version; it doesn't allow you to compare two arbitrary versions. Furthermore, the format of the comparison, while, I think, adequately clear, is not very pretty. Finally, because it maintains only one branch, Smeagol has no mechanism to merge branches.

But: does Smeagol - that simple Wiki engine I whipped up in a few days to solve the problem of collaboratively editing a single document - hold some lessons for how we build a collaborative news system?

Yes, I think it does.

The National publishes of the order of 120 stories per issue, and of the order of 300 issues per year. That's 36,000 documents per year, a volume on the scale of the Linux kernel. There are a few tens of blogs, each with with at most a few tens of writers, contributing to Scottish political blogging; and at most a few hundred journalists.

So a collaborative news system for Scottish news is within the scale of technology we already know how to handle. How much further it could scale, I don't know. And how it would cope with federation I don't yet know - that might greatly increase the scaling problems. But I will deal with federation in another essay later.

No comments:

Creative Commons Licence
The fool on the hill by Simon Brooke is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License