[Date Prev][Date Next] [Thread Prev][Thread Next]
[Date Index] [Thread Index] [New search]

Migration to XML_Docbook



Below is a very preliminary analysis I was asked to perform for someone 
interested in converting unstructured FM docs to structured FM+SGML docs 
that conform to the Docbook DTD. I am posting it on the TECHWR, FrameSGML, 
and framers lists because I believe it may be of general interest. I invite 
comments, particularly if you disagree with anything stated herein.
==============================================
I have examined the FM document you sent. It appears to be consistently 
tagged. The paragraph and character tagging scheme is quite simple, and 
reflects a relatively small number of document object types (e.g., body 
text, bulleted list, datafile). The Docbook DTD defines approximately 120 
different elements. My own opinion is that Docbook is the DTD from hell, 
and should be avoided at all costs unless you are being forced to use it.

1. CONVERSION TO STRUCTURED FM+SGML DOCUMENTS
Obviously, any kind of automated conversion method to go from FM 
unstructured to FM+SGML structured in conformance with a DTD  requires that 
the paragraph and character tags be unambiguously mappable to applicable 
elements in the DTD.  Furthermore, there is no way that attribute values 
for elements in the resulting converted docs can be properly assigned 
(i.e., all values would be initialized to their DTD-specified default 
value, if any).

FM+SGML has a built-in capability to convert unstructured docs to 
structured ones, using structure rule tables to map the various tagged 
document objects in the unstructured doc to the corresponding SGML 
elements. When there is a 1:1 relationship (as opposed to a 1 to many or 
many to 1) of each tagged object to a corresponding SGML element, structure 
rule tables can do a fairly good job, however manual cleanup work is 
inevitable to make the converted document fully conformant to the DTD/EDD, 
and to apply the appropriate attribute values.

I conclude that your documents probably do not fit well with the above 
conversion requirements, particularly for conversion to a DTD/EDD as 
complex as Docbook, however, a more thorough analysis might show otherwise, 
particularly if you decide to develop your own DTD/EDD whose structure 
closely resembles that of your existing documents.

There is one additional requirement that must be met for unstructured to 
structured conversions to be possible: The entire FM document must have a 
single text flow.

Obviously, once you've converted an unstructured FM document to a 
structured FM+SGML one, you never again want to revert back to the 
unstructured one for editing or anything else. After conversion, you should 
discard the original (first verifying, of course, that everything was 
properly converted).

2. VERSION CONTROL
You mention keeping the content of these documents (in .txt or .mif format) 
in a CVS. Clearly, storing .txt or .mif is not the answer. Instead,  you 
should export the documents from FM+SGML to XML and store that.

XML has many new features (including Unicode) that make it superior to SGML 
(and certainly cosmically better than ASCII text or MIF)  for database 
storage. Storage in this form has the added advantage of allowing you to 
maintain revision/version control at any desired level of granularity, 
because the proper kind of database repository can parse the document into 
its individual components (i.e., elements and external entities (e.g., 
graphics)),  maintain revision/version information on each component, and 
retrieve any desired portion of any desired version.

A CVS/data repository that stores XML can become the sole source of 
controlled documents for an entire enterprise. Information is retrieved 
from the database by human and non-human queries. Middleware (e.g., 
Omnimark) is used to process the information extracted by these queries to 
match the requirements specified by the users. XSL style sheets (also part 
of the XML standard) can be created by the middleware to format the 
information when it is viewed in an XML-aware browser.

3. ROUND-TRIPPING BETWEEN THE CVS AND FM_SGML
Ideally, you would originate, revise, and edit your structured documents in 
the WYSIWYG environment of FM+SGML, export them as XML  for storage in the 
database repository, and check  the documents (or any portion thereof ) 
directly out of the database into FM+SGML for incorporating changes, as 
well as for printing them or converting them to PDF or other formats. 
However, XML round-tripping is not possible because FM+SGML (including the 
new 6.0 version) can export XML but cannot import it. Consequently, if you 
export your documents as XML for storage in the database, you'll have to 
use a middleware product like OmniMark to convert the XML. document 
instances to SGML  before they can be imported into FM+SGML. This 
conversion from XML to SGML also requires that Unicode characters with ANSI 
numbers above 127 (as well as any other non-english characters), be 
converted to their equivalent ISO character set entity references, since 
FM+SGML cannot process Unicode input.

It is extremely unfortunate that FM+SGML (including the new version 6.0) 
does not implement Unicode. If Unicode  had been fully implemented, it 
would have been possible to use multi-language Unicode fonts with FM+SGML, 
which would have greatly facilitated language translations, including the 
intermixing of two or more languages in the same document. The intermixed 
languages would have been fully preserved on export to, or import from, XML.

4. LINK PROBLEMS
Another problem is links (i.e., cross-references and hypertext links). 
FrameMaker implements cross-reference links using ID and IDREF attributes 
which conform to  the SGML standard. This is OK when all such links are 
internal to the exported SGML document instance, but external 
cross-references created in FM+SGML do not  produce links that work when 
the document is exported to SGML, because FM+SGML, on export, cannot 
produce an IDREF attribute value that includes the location of the external 
file (This is a limitation of SGML). To make it worse, neither the internal 
nor the external cross-references work if the document is exported to XML, 
because links in XML are implemented differently, as specified in the XLink 
and XPointer portions of the XML standard. You could create XML-conformant 
equivalents of the ID and IDREF attributes in the FM+SGML EDD (and the 
corresponding DTD), however, these attribute values, unlike FM 
cross-references,  will have to be manually entered for the elements at 
each end of each link, and the links will not work in FM+SGML..

5. FORMATTING
You also mentioned that it would be nice to be able to preserve the "look 
and feel" of the existing unstructured documents after they've been 
converted to structured documents. This is where FM+SGML really shines. All 
of the formatting specifications are defined in the EDD and its companion 
template. Consequently, you can make the converted FM+SGML documents 
closely resemble the formatting of the current documents. Also, when you 
import an XML or SGML document instance into FM+SGML, the formatting 
specified in the EDD is applied.

When you export an XML document instance, you can also produce a Cascading 
Style Sheet (CSS) that is derived from the formatting specifications in the 
EDD and its companion template. Thus, if you open an exported XML document 
instance with a CSS in an XML-aware browser such as IE5, the formatting 
(but not necessarily the layout) of the original document will be replicated.

CONCLUSION
As you can see, conversion to structured FM+SGML documents is not a trivial 
undertaking, and the full utilization of  all the benefits that can be 
derived therefrom is made difficult by some of FM+SGML's current 
limitations. The initial investment is high, but if your operation is large 
enough, the savings possible in areas such as author productivity, document 
quality assurance, revision control, information reuse, and information 
repurposing  will pay back those costs many times over.




====================
| Nullius in Verba |
====================
Dan Emory, Dan Emory & Associates
FrameMaker/FrameMaker+SGML Document Design & Database Publishing
Voice/Fax: 949-722-8971 E-Mail: danemory@primenet.com
10044 Adams Ave. #208, Huntington Beach, CA 92646
---Subscribe to the "Free Framers" list by sending a message to
majordomo@omsys.com with "subscribe framers" (no quotes) in the body.



** To unsubscribe, send a message to majordomo@omsys.com **
** with "unsubscribe framers" (no quotes) in the body.   **