[Date Prev][Date Next] [Thread Prev][Thread Next]
[Date Index] [Thread Index] [New search]

Re: Conversion of Word documents to structured frame documents




Dan Emory wrote:

> Thanks, Marcus for finally clearing this up. I'm still suspicious that
> the devil is in the details, however. I guess I'll just have to break down
> and get a copy of FM+SGML 5.5.6 to find out for myself.

It looks pretty straightforward, but I'll need to play with it a bit more as well.

> > I have written many filters to go from SGML to RTF, so my approach
> > would be different. I would save the SGML out of FrameMaker+SGML
> > and write a conversion that dealt with converting SGML conforming
> > to a specific DTD to RTF.
>
>
> But if you do it the way you describe above, there's no way to include the
> RTF formatting information (font definitions, style sheets, ad-hoc format
> overrides, etc.)

Most of that I would obtain by reverse engineering the Word document - the most
efficient way is to cut the top off a valid RTF file that adheres to the template
and then apply style definitions by number, but that's getting way off topic. What
you say is correct - the styling information would reside with the conversion code,
not the data itself. Whether that is good or bad could probably be argued validly
from either side.

> After further analysis, it appears that SEMA's round-trip filters have the
> principal purpose of archival storage of unstructured Word (and other
> RTF-compatible WP products) documents in a neutral format (XML or SGML) that
> preserves the formatting information so that they can be recovered years
> later when the original WP is no longer available.

That's a great idea - kind of half way between SGML and Postscript. It's inexpensive
and fairly future proof - when you need the data, you do an easy conversion into
your destination format.

> However, the SEMA rtf2rdc filter's preservation (in attributes) of the
> original font definitions, stylesheet, and document header, combined with
> the preservation (again in attributes) of any ad hoc format variations in
> XML/SGML paragraph and character style element instances, is a nice touch.

Yes, but if you were really looking for a long term and flexible storage model, it
might produce a DTD with defaulted attribute values so you could manage the dataset
globally, but that's getting more than just mildly ambitious...

> The fact that each paragraph (PARA) and character style (CS) element in the
> RTF-DOC DTD has attributes that identify the applicable stylesheet instance
> being used offers the opportunity to use SGML- or SML-aware tools to convert
> RTF-DOC document instances to more elaborate structures if:
>
> 1. The stylesheet names are indicative of the structure, AND

Yes, but content models such as:

<!ELEMENT chap     -  -  (section+, para)>

<!ELEMENT sect     -  o  (para+)>

would require you to tag the last para in a chap with a style named something like
chap-para and the paras in a section would be style tagged as sect-para, even though
they may look exactly the same. It's a slightly contrived example, but the more
complex the style tagging scheme gets...

> 2. Consistent tagging was utilized during the preparation of the original WP
> document.

... the less likely this is to happen. I routinely borrow an old trick for beating a
lie detector test - when I'm going into a meeting that involves the conversion of
legacy Word documents, I position a thumb tack between my toes. Right when they say
"... but we use styles consistently", I jam the point into the underside of my big
toe to prevent me from laughing. It hurts a lot, but it prevents the otherwise
inevitable stony silence and glowering looks.

> Even when round-tripping is not a regular occurrence, it can still be a
> vital requirement. For example:
>
> 1. Source data is often created in Word, and must be converted to Frame
> without introducing errors or the need for extensive post-conversion clean-up.

This may be a valid reason, though I would have a difficult time believing that two
users created different documents exactly the same way, allowing for import free
from any intervention.

> 2. Legacy documents in Word need to be converted to Frame, particularly when
> an organization first acquires Frame.

I would never trust these document to be consistent. If the objective is to get the
same dubious quality that existed in the Word documents into Frame it would work
fine, but most organisations would require a cleanup of the data anyway, otherwise
you can't even generate a reliable TOC.

> 3. An enterprise's tools for converting to on-line context-sensitive help
> (e.g., HTML Help, WinHelp, RoboHelp, etc.) may require that the input be in
> Word or RTF, necessitating error-free conversions of Frame docs to RTF or Word.

I'd be more inclined to purchase a new tool for on-line help than purchase something
that allowed me to use my old tool.

> 4. Documents created in Frame may have to be repurposed using Word (e.g.,
> training materials produced by departments other that don't use Frame.

This is a real requirement, though it doesn't involve round tripping. There are
already reputable tools for doing an Omni-directional conversion, (though their
names currently escape me :-).

> If the market for Frame products is to broaden, its
> capability for round-trip conversions to/from other formats must be expanded
> and improved. The most likely source of such conversion tools is third-party
> software vendors like SEMA, Omni Systems, Blueberry, Quadralay, and (in the
> case of SGML/XML) OmniMark.

That only alleviates the symptoms - it doesn't cure the illness. A better way to is
for application designers to agree on (or be forced into) standardisation of an
interchange format. Adobe is there, and Microsoft have committed to support XML (IE5
does now). Although tools will always be required for legacy documents, the need for
new filters will surely diminish at a rate proportional to the uptake of XML.

> It now appears that Adobe plans to issue a major new release of Frame about
> once every two years (provided they can find a way to avoid bug-ridden point
> releases such as 5.5). Third-party software vendors of conversion tools are
> on a much shorter release schedule, because the nature of their business
> demands it. Adobe would be better off if it subsidized, or in other ways
> supported, those third-party vendors rather than trying to develop adequate
> conversion tools within the Frame product itself. Promotional deals could be
> struck that offered these third-party products at a deep discount to Frame
> license holders.

Adobe aren't interested in subsidising third party tools or improving their own
filters - they want to solve the problem. This might mean the eventual replacement
of MIF by XML as the principal non-binary data source - would that be such a bad
thing?


--
Regards,

Marcus Carr                      email:  mrc@allette.com.au
___________________________________________________________________
Allette Systems (Australia)      www:    http://www.allette.com.au
___________________________________________________________________
"Everything should be made as simple as possible, but not simpler."
       - Einstein



** To unsubscribe, send a message to majordomo@omsys.com **
** with "unsubscribe framers" (no quotes) in the body.   **