Saturday 7 December 2013

More grief creating formatted documents

I write my fiction using a 'word processor' which is in fact no more than a hacked together set of shell scripts. To produce final proof output, I need a tool to render my text into nicely formatted PDF or Postscript. I do this by way of HTML, but I still need a tool for the HTML to PDF step. For years I've used Prince, which is very good indeed. It has three problems from my point of view
  • It's proprietary software, and although you can legitimately use it for free, if you do it prints its own logo on the cover page of your document;
  • It's too expensive (US$ 495) for me to be able to really justify a license;
  • And finally - for me this one's the killer - it doesn't run on Debian, and because it isn't free, you can't just compile it yourself.
It does run on Ubuntu, and consequently I do run Ubuntu on one of my machines just so that I can run Prince, but I now want to run my fiction through my continuous integration toolchain, which runs under Jenkins on my server; and my server runs Debian.

Consequently over the years I have periodically evaluated other options - genuinely free software options - for doing my final formatting step. So far, I've found nothing good enough. Today, I've tried again.

There are two new options, pandoc and wkhtmltopdf.


Pandoc is an ambitious project to create a Swiss army knife for converting between text document formats - to do for text documents what ImageMagick does for raster graphics. To generate PDF it depends on LaTeX, and can use a variety of different LaTeX libraries to achieve this effect. Unfortunately, the LaTeX stage crashes and I'm not sufficiently up on debugging LaTeX to work out why, although the error message suggests it's to do with not finding the right fonts:

simon@engraver:~/Documents/fiction/slave$ pandoc -o merchant.pdf merchant.html 
pandoc: Error producing PDF from TeX source.
! Font T1/cmr/m/n/10=ecrm1000 at 10.0pt not loadable: Metric (TFM) file not found.
l.100 \fontencoding\encodingdefault\selectfont

So from my point of view, that's a fail.


Wkhtmltopdf uses the Webkit rendering engine - the same one used in the Konqueror, Chrome, Safari and now Opera browsers - to render the page. You'd think that would be bound to be a good one. Unfortunately, it isn't. It doesn't honour - presumably because browsers don't need to - all the print-oriented vocabulary of CSS

My stylesheets specify different page margins for left and right hand pages, to put the wider margin in the gutter; they specify that left hand pages should show the book title in small caps on the left of the header line, and the page number on the left of the footer line; while right hand pages should show the current chapter title in small caps on the right of the header line and the page number on the right of the footer line. They specify that an image on the cover should bleed to the edge of the page. They specify that references in the table of contents should be resolved to the page number on which the content appears. They specify even that pages in the frontmatter should be numbered in roman numerals, while pages in the body should be numbered in arabic numerals.

Wkhtmltopdf honours none of this. It doesn't show page headers at all, or page numbers. It can't resolve table of contents references. It won't bleed the cover page differently from the rest of the content. To be fair, it has to be said that the Amazon Kindle, which really ought to, does not honour this vocabulary either; but seventeen years after the publication of the CSS1 specification and ten years after Prince XML was launched, it's a bit disappointing.

Worse, wkhtmltopdf occasionally splits a single line of text over two pages, so that the top half of the letters appear on the bottom of one page while the bottom half are at the top of the following page.

So that too is a fail; not quite such an epic fail as pandoc, but not good.

So it looks as though I'm stuck with Prince; one of these days I may even have to buy it.


It transpires that although Prince is not supplied packaged for Debian, there is a 'generic linux' version (a gzipped tar containing, inter alia, an install script which installs neatly and correctly into /usr/local/) which works perfectly on Debian. So I'm happy again.

No comments:

Creative Commons Licence
The fool on the hill by Simon Brooke is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License