13 November 2007

From plaintext to Pandoc, searching for a perfect markup

Sometimes there are issues that seem to come up again and again over a period of months and years. For myself, one of the biggest annoyances of recent years has been finding a form of lightweight markup that I can compose all of my documents in. This kind of thing is a very personal choice, with that in mind I will share the process by which I made the choice I am now happy with and the simple and lazy workflow it gives me.

Why is this so important to me?

I spend most of my time at a computer in front of a text editor, I know the editor inside out and so it is a very efficient environment. Even if I am not on my own system I can always find a text editor and just start writing whatever environment I am on, Unix, Windows, even my old Psion. So, a plaintext solution for all of my documents is a great timesaver.

Consider a few of the documents I deal with on a day to day basis which each need a different application to work with:

  • Webpages need to be HTML
  • Emails need to be plaintext
  • Documentation and application forms need to be richtext
  • Manuals need to be .pdf
  • Python documentation needs to be RestructuredText
  • Wiki/Trac entries need to be in their wiki format

Frequently I need several different formats at once; HTML, pdf and richtext for user documentation is a typical example so writing one master document and being able to generate several outputs is a great timesaver and makes document revisions easier to manage.

Why is it so difficult to find a solution?

I have two very different requirements, firstly a markup syntax that I feel comfortable with, secondly a set of unified tools that enable easy conversion to other formats.

The path to the final solution

At first I started writing everything in HTML since I used it so often, but as expected this was too verbose and difficult to read and needed lots of tools to convert too and from different formats.

Next I discovered Markdown by John Gruber, it has a very simple plaintext format that is easily readable and converts to very clean HTML, however it was lacking (intentionally) a few little things I used frequently, most notably definition lists and simple tables. Having said that, it was a pleasure to work with and made a lot of sense so I used it a lot.

The aforementioned shortcomings meant that over time I researched other Markdown varieties and came to Multimarkdown by Fletcher Penney. This solved a lot of my problems and gave me a broader variety of output formats, but meant that if I used other markdown tools it didn’t display quite as I would have liked.

Around this time I had begun using Python as my coding language of choice, so it seemed that reStructuredText was the obvious way to go; incredibly full-featured with lots of supporting tools and the documentation language for Python. I tried and really tried with it but for some reason the syntax just seemed a little ugly and I didn’t seem to be saving much time between the syntax and the various different tools I had to use to output different formats from it.

So eventually I went back to using a flavor of markdown again and made a great discovery...

And the winner is… Pandoc

Pandoc is a well thought out and well executed piece of software that I now find indispensable. Written in Haskell it is extremely fast, but more than that, it has a huge amount of features that fulfill 99% of my requirements, such as:

  • Markdown compliant
  • generates document fragments or whole documents
  • reads HTML and simple reStructuredText
  • writes HTML, reStructuredText, RTF, pdf

My lazy document workflow

Firstly I created a folder containing a CSS stylesheet for HTML output, some custom headers and footers for documents and a logo. Next I created a simple shellscript which when passed a filename will create a folder (based on the name) and convert the file into every possible output format based on the input file type.

An input file foo.txt will generate the following files in a couple of seconds:

foo/
 foo.html
 foo.s5.html
 foo.rtf
 foo.xml
 foo.la.tex
 foo.con.tex
 foo.pdf
This is of course more than I usually need, but the process is so fast and easy that I just grab the ones I need and then delete the rest of the folder.

Finally

Finally I have a toolset that does almost everything I need and a whole lot more all in one integrated bundle and lets me use my favorite markup syntax.

I recommend that you take a look at Pandoc if you have similar needs.

No comments: