Repurposed with permission from Data Conversion Laboratory, DCLNews

If everyone formatted documents “correctly”, following all the rules, document conversion would be a “piece of cake”, as Michael Gross likes to say. But that’s not the real world. People don’t “always” read manuals, they might take some shortcuts, and sometimes they use software in ways never conceived by their developers. In this interview Michael Gross tells us about the hidden traps that can ambush a conversion effort.

Q: I recently saw a website advertising fully automated legacy document conversion to XML. Can this be true? Very often it [automated conversion] looks good on paper, but is hard to make modifications to, and even harder to convert to a structured markup.

A: While it’s possible for tools or web-based software to perform a completely automated conversion, that doesn’t necessarily mean that the documents produced are “ready-to-use” XML. Assuming that the documents you want to convert are authored in a word-processing or a desktop publishing environment, the bulk of the document could certainly be converted to an XML representation. The primary challenge, however, is that electronic publishing is mostly about appearance (how the document will look) while XML is more about structure and content, so in many cases, the structure must be inferred from visual clues in the source documents (e.g. if something “looks” like a heading then it’s probably a heading, but not always).

Typically 95% of the document will convert properly, but the remaining 5% … take 95% of the effort to cleanup.

The adage that 95% of the effort is expended on the last 5% of a project applies to document conversion as well. Typically 95% of the document will convert properly, but the remaining 5% that automated conversion can’t do right will take 95% of the effort to clean up. In reality, the accuracy of a conversion is partially a function of how well the documents were authored to begin with. Today’s modern electronic authoring environments have powerful features that allow users to setup sophisticated stylesheets and automatic link and cross-reference generation, so that if used properly, the conversion can get pretty close to a perfect conversion. But that is only if (and that’s a pretty big IF) the document authoring rules are closely followed and enforced for all document authors.

In the real world, product technical documentation is often completed in a rush, immediately before a product ships, and no one is taking the time at that point in the process to make sure that a document is being authored in the optimal way to make conversion easier. To make things worse, the people making changes often don’t really know the authoring environment very well, so the authoring is done sloppily…but it looks okay on paper, and that’s what counts at that moment. An XML industry pundit once told me that instead of today’s electronic publishing being referred to as WYSIWYG (What You See Is What You Get) authoring, it more appropriately should be called WYSIATYG – What You See Is All That You Got, because very often it looks good on paper, but is hard to make modifications to, and even harder to convert to a structured markup.

Q: What are the document elements that make totally automated conversion to XML difficult?

A: In conversion to XML, some tags relate to document structure, and others relate to document content. Computers are just not that good at understanding “meaning.”

Content Tags. Document content tagging is usually the most challenging, since content tags usually refer to the meaning, or semantics, of what they contain. Computers are just not that good at understanding “meaning.” So for instance, if the XML tagging requires a tag placed around a repair procedure, since the source publishing documents do not usually contain that type of information, even in the best case, you need to infer this information as best as you can. This is often done by looking for specific word patterns (looking for words such as “Repair’ or ‘Fix’). This type of approach will usually require a human review, since the automated process is bound to get it wrong at least some of the time (even humans don’t always agree on these). So you can expect to need to perform review on many of your content tags.

Structure Tags. Regarding automated conversion of structure tagging, here are some examples of document structures that might trip up the automated conversion:

  • Footnotes: They are often created using superscripts not a footnote tool, leaving you to guess if a superscripted 2 means the mathematical notation for ‘squared’ or a reference to footnote number 2.
  • Cross References: If used properly in modern electronic publishing systems, using the appropriate tools, these are readily convertible. But here again, very often they are simply authored as straight text, so that if a table reference says, “See Figures 3, 5, 7, and the pie-chart at bottom of 9,” converting all of those to proper figure references is much harder than telling a computer to change every figure x to a cross-reference. And then you need to deal with page-based references (such as “see next page”) that don’t belong in XML.
  • Tables: Tables often take the most effort to convert. To begin with, many times, source documents contain table-looking objects that were not authored in a table editor. They may have been done using frames and absolute positioning on a page to simulate the look of a table, or more often, using a combination of tabs and spaces. All of these make coming up with the proper XML table representation fraught with challenges. But even for tables authored within a table editor, doing a proper automated conversion can be challenging. Authors sometimes use hard returns within cells across a row to simulate the appearance of rows within a table. Line wrapping will often not occur the same way with XML, so it is particularly important that these hard returns within the source rows be converted to proper rows within the XML tables. Other formatting oddities can cause unexpected problems as well, for example, tables that have been cut and pasted to fit a page, or skewed in some way to match the layout. These types of situations often require manual intervention to get them converted properly.
  • Text and Special Characters: Converting special characters (like the Greek Alpha or a degree symbol) is getting easier because of the adoption of Unicode in many cases. But unfortunately, in source documents, special characters are often done by using a particular character from a rare font. Automated conversion requires that a Unicode equivalent of every character in every font be found, which can be a daunting task. Some characters have no Unicode equivalent and need to be converted as image references within the XML. Even something as “simple” as converting headings to title casing (i.e. the first letter of each word capitalized, with some exceptions) can’t be converted properly without some review first (so that “IBM PRINTERS” does not become “Ibm Printers”).

Q: Can I assume that if I already have structured documents in an XML or SGML format, conversion to another XML structure can be completely automated?

A: While this is a reasonable expectation, the reality is that because not all XML documents are created equal, manual intervention, even in XML to XML conversion, is often required. Remember that the degree of content tagging within XML markup schemes vary greatly. If the target markup scheme requires a higher level of content markup than the source documents, then you will likely require some manual intervention. For instance, if you are converting to DITA (an increasingly popular XML markup scheme used for technical documentation), you need to break documents down into individual reusable topics, and define a type for each topic. This information doesn’t exist in most XML source documents, so some form of manual intervention is required.

Q: I’ve already converted my documentation to HTML webpages. Now I’d like to elevate these pages to an XML version. Can this be done with an automated conversion?

A: Because HTML is principally based on the SGML and XML markup standards, people assume that conversion to XML should be easy. However, HTML can often be one of the most difficult to convert. First, often the source HTML is not well-formed because Web Browsers tend to be quite forgiving and don’t enforce very much structure. Secondly, in order to accomplish a certain appearance on the page, HTML markup can often use convoluted tagging designed simply to produce a certain web browser rendering. So, for example, HTML pages typically contain additional HTML table structures that are not really meant to be tables (in an XML sense), but were used to position certain elements on a webpage. So now, to convert to XML, you’ve got to differentiate between true XML tables (that should remain in the XML) and positional tables (that should be stripped from the HTML). In addition, HTML pages tend to be cluttered with navigational aids, javascript code, and advertisements, all of which will need to be removed to produce correct XML. As is true in other legacy document automated conversion, if the HTML pages were authored in a highly consistent fashion, there is a greater possibility that you can build an automated conversion that will produce accurate XML.

If you know your target format’s requirements in advance . . . you’ll have a better chance at converting them in an automated fashion

So whether you are converting from electronic publishing formats or from markup formats, if you know your target format’s requirements in advance, and can plan the way you author your documentation, developing rigorous standards that are followed precisely, you’ll have a better chance at converting them in an automated fashion. If this is not the case (and it rarely is), for the reasons that we have outlined, you should expect to put in a fair amount of manual effort to check the results of your automated conversion and to address those issues that the conversion could not handle properly.

About Michael Gross

Michael Gross is the Chief Technology Officer and Director of Research and Development for Data Conversion Laboratory. He is responsible for all software-related issues, including product evaluations, feasibility studies, technical client support, and management of in-house software development. He has been solving digital publishing conversion problems at DCL for twenty years and has overseen thousands of legacy conversion projects.