APHID-DocBook
 

Inside APHID

Note
Unless you are intersted in the inner workings of APHID, you can feel free to skip this page.

How it all Works

APHID uses a number of different tools to accomplish its task. In order to understand how all the pieces fit together we need talk about the various technologies involved.

The Big Picture

attribution
Much of the text in this document is taken from the online version of Eric Steven Raymond's The Art of Unix Programming distributed under the terms of the Creative Commons Attribution-NoDerivs 1.0 license. A reference copy of this license may be found at http://creativecommons.org/licenses/by-nd/1.0/legalcode.

DocBook is a structural-level markup language. Specifically, it is a dialect of XML. A DocBook document is a piece of XML that uses XML tags for structural markup.

For a document formatter to apply a stylesheet to your XML document and make it look good, it needs to know things about the overall structure of your document. For example, in order to physically format chapter headers properly, it needs to know that a book manuscript normally consists of front matter, a sequence of chapters, and back matter. In order for it to know this sort of thing, you need to give it a Document Type Definition or DTD. The DTD tells your formatter what sorts of elements can be in the document structure, and in what order they can appear.

What we mean by calling DocBook a dialect of XML is actually that DocBook is a DTD — a rather large DTD, with somewhere around 400 tags in it.

Lurking behind DocBook is a program called a validating parser. When you format a DocBook document, the first step is to pass it through a validating parser (the front end of the DocBook formatter). This program checks your document against the DocBook DTD to make sure you are not breaking any of the DTD's structural rules (otherwise the back end of the formatter, the part that applies your stylesheet, might become quite confused).

The validating parser will either throw an error, giving you messages about places where the document structure is broken, or translate the document into a stream of XML elements and text that the parser back end combines with the information in your stylesheet to produce formatted output.

The following figure diagrams the basic XML document process:

The DocBook Process

The part of the diagram inside the dotted box is your formatting software, or toolchain. Besides the obvious and visible input to the formatter (the document source) you need to keep the two hidden inputs of the formatter (the DTD and stylesheet) in mind to understand what follows.

To turn your documents into HTML or PDF, you need an engine that can apply the combination of DocBook DTD and a suitable stylesheet to your document.

The following figure illustrates how the open-source tools for doing this fit together.

The DocBook Toolchain

Parsing your document and applying the stylesheet transformation within APHID is handled by Saxon. There are other transformation programs available, most notably xsltproc and Xalan.

It is relatively easy to generate high-quality XHTML from DocBook; the fact that XHTML is simply another XML DTD helps a lot. Translation to HTML is done by applying a rather simple stylesheet, and that's the end of the story. RTF is also simple to generate in this way, and from XHTML or RTF it's easy to generate a flat ASCII text approximation.

The more difficult case is print. Generating high-quality printed output — which means, in practice, Adobe's PDF (Portable Document Format) — is difficult. Doing it right requires algorithmically duplicating the delicate judgments of a human typesetter when moving from content to the presentation level.

So, first, a stylesheet translates DocBook's structural markup into another dialect of XML — FO (Formatting Objects). FO markup is very much presentation-level; In the APHID toolchain, this job is handled by FOP, a direct FO-to-PostScript translator being developed by the Apache project.

One of the things that makes learning DocBook difficult is that the the newbie tends to become overwhelmed with long lists of W3C standards, learning all the associated XML tags, etc. Let's face it: sometimes you just want to start composing your document without having to worry about the technical details of formatting it correctly. That is where APHID comes in.

DocBook Without all the Pain

With APHID, you are free you compose your documents as simple ASCII text documents. This is possible because APHID extends the basic docbook processing stream by prepending an ASCII text-to-DocBook/XML converter.

The following figure illustrates the APHID process stream:

The APHID Process

APHID parses your document and applies a stylesheet to transform a structured text document into DocBook/XML. Once the document is converted into XML, the rest of the standard docbook processing tools are used to provide the various output formats. The program the does this is within the APHID toolchain is APTconvert, where the APT stands for Almost Plain Text.

The following figure illustrates the toolchain used to covert text documents into other output documents:

The APHID Toolchain

Notice that the source document is composed of structured text. This means that the source document is not entirely free-form. This is necessary because there has to be some way to convert things into 'titles', 'sections', 'paragraphs', etc.

The good news is that the formatting required is minimal. APHID uses as little markup as possible to express the structure of the document. Instead, it uses paragraph indentation.