Converting legacy documents into HTML

The World-Wide Web has created an unprecedented flooding of information. In the world. Back to the prehistoric times, finding informations used to take an important amount of time in the day of many of us, looking for technical, scientifical, juridical or any other kind of information. I remember the frustration when I needed days to find the specification of a file format or an electronic component (in books made of paper, yes, paper, young people can you believe it ???).

These dark ages are gone.

We are now in an era where obtaining the data we daily need is no more a matter of storing voluminous technical books, subscribing catalog update information letters, or having access to an expensive library. We live the glory days where we just need to get access to an indexer, type a few keywords and get everything from the Unix disk blocks organization to french cooking (how sweet !). Known pioneers from Ted Nelson to Tim Berners-Lee, from Marc Andreesseen and Eric Bina to Louis Monier and many other known and unknown peoples deserve the credit for that. As do every anonymous ant that work daily to put more and more documents on the Web.

Is that so simple ? Unfortunately no.

The point is that humanity hasn't started to think the day Marc or Eric launched Mosaic and it didn't crash immediately. A large body of documents where written prior the Web (or gopher or even ftp) birth. And we still use them. Day to day, they are converted to HTML so we can put them on the Web. Producing a good, useful, document from these legacy docs is not always easy, specially considering that they haven't been designed to be put on the Web.

So far, I have converted four non-trivial documents for the Web: The Inter-Client Communication Conventions Manual, the GIF89a specification, the GWM Manual, and the Xlib Programming Manual.

I would like to share the little experience I gained doing this work, so as to avoid repeating some errors.

Hypertext is more than text

A good hypertext document is more than a text added with some tags to make it look fancy. A good hypertext document should contain hyperlinks. A seminal paper on hypertext by Vannevar Bush was entitled "As we may think". And this is the essence of hypertext: when reading you are interrogating yourself about what this or that means, you should be able to just click on the disturbing word (or take whatever action is appropriate), to be self-taught of who was this guy Vannevar Bush and what he has written.

Always provide a way to read the document linearly

Although hypertext is in many ways superior to plain old text, a good hypertext document is designed as an hypertext document, while a plain old text was written to be read linearly. This is the way it is the more understandable. So you should provide a way to read the document from the first to the last word in a linear manner. Do it by respect towards the original author's work, if for no other reason.

Forget about numerical (or alphabetical) references

In legacy document, you often find references such as "see section 3, p. 24". You should forget about this kind of references. They were invented in a world without hyperlinks, to allow for fast information navigation. These references should be hyperlinked, of course, but the title shouldn't be something as meaningless as "sec. 3". As an example, consider the following, and guess which one is the more helpful in finding the information you are looking for:

For further details, see sections 4.1.2.4 and 4.1.4.

For further details, see Hints and properties and Changing Window State.

Read and use your own translation

The same way good software is software written by people who use it, I do think that good hypertext is the hyperlinked document that was needed by the person who translated it. Try to be in the state of mind of a novice reader, and browse through your translation. Most of the time, you will think: "I should have a link here", or "what does this mean exactly ?". This is the best way to get a rich hypertext document.

Christophe Tronche, ch@tronche.com