[Date Prev][Date Next]
[Thread Prev][Thread Next]
[Date Index]
[Thread Index]
[New search]
To: Dan Emory <danemory@xxxxxxxxxxxx>
Subject: Re: Conversion of Word documents to structured frame documents
From: Marcus Carr <mrc@xxxxxxxxxxxxxx>
Date: Fri, 09 Apr 1999 12:46:05 +1000
Organization: Allette Systems (Australia)
References: <2.2.16.19990408141824.243f462a@pop.primenet.com>
Sender: owner-framers@xxxxxxxxx
Dan Emory wrote: > Thanks, Marcus for finally clearing this up. I'm still suspicious that > the devil is in the details, however. I guess I'll just have to break down > and get a copy of FM+SGML 5.5.6 to find out for myself. It looks pretty straightforward, but I'll need to play with it a bit more as well. > > I have written many filters to go from SGML to RTF, so my approach > > would be different. I would save the SGML out of FrameMaker+SGML > > and write a conversion that dealt with converting SGML conforming > > to a specific DTD to RTF. > > > But if you do it the way you describe above, there's no way to include the > RTF formatting information (font definitions, style sheets, ad-hoc format > overrides, etc.) Most of that I would obtain by reverse engineering the Word document - the most efficient way is to cut the top off a valid RTF file that adheres to the template and then apply style definitions by number, but that's getting way off topic. What you say is correct - the styling information would reside with the conversion code, not the data itself. Whether that is good or bad could probably be argued validly from either side. > After further analysis, it appears that SEMA's round-trip filters have the > principal purpose of archival storage of unstructured Word (and other > RTF-compatible WP products) documents in a neutral format (XML or SGML) that > preserves the formatting information so that they can be recovered years > later when the original WP is no longer available. That's a great idea - kind of half way between SGML and Postscript. It's inexpensive and fairly future proof - when you need the data, you do an easy conversion into your destination format. > However, the SEMA rtf2rdc filter's preservation (in attributes) of the > original font definitions, stylesheet, and document header, combined with > the preservation (again in attributes) of any ad hoc format variations in > XML/SGML paragraph and character style element instances, is a nice touch. Yes, but if you were really looking for a long term and flexible storage model, it might produce a DTD with defaulted attribute values so you could manage the dataset globally, but that's getting more than just mildly ambitious... > The fact that each paragraph (PARA) and character style (CS) element in the > RTF-DOC DTD has attributes that identify the applicable stylesheet instance > being used offers the opportunity to use SGML- or SML-aware tools to convert > RTF-DOC document instances to more elaborate structures if: > > 1. The stylesheet names are indicative of the structure, AND Yes, but content models such as: <!ELEMENT chap - - (section+, para)> <!ELEMENT sect - o (para+)> would require you to tag the last para in a chap with a style named something like chap-para and the paras in a section would be style tagged as sect-para, even though they may look exactly the same. It's a slightly contrived example, but the more complex the style tagging scheme gets... > 2. Consistent tagging was utilized during the preparation of the original WP > document. ... the less likely this is to happen. I routinely borrow an old trick for beating a lie detector test - when I'm going into a meeting that involves the conversion of legacy Word documents, I position a thumb tack between my toes. Right when they say "... but we use styles consistently", I jam the point into the underside of my big toe to prevent me from laughing. It hurts a lot, but it prevents the otherwise inevitable stony silence and glowering looks. > Even when round-tripping is not a regular occurrence, it can still be a > vital requirement. For example: > > 1. Source data is often created in Word, and must be converted to Frame > without introducing errors or the need for extensive post-conversion clean-up. This may be a valid reason, though I would have a difficult time believing that two users created different documents exactly the same way, allowing for import free from any intervention. > 2. Legacy documents in Word need to be converted to Frame, particularly when > an organization first acquires Frame. I would never trust these document to be consistent. If the objective is to get the same dubious quality that existed in the Word documents into Frame it would work fine, but most organisations would require a cleanup of the data anyway, otherwise you can't even generate a reliable TOC. > 3. An enterprise's tools for converting to on-line context-sensitive help > (e.g., HTML Help, WinHelp, RoboHelp, etc.) may require that the input be in > Word or RTF, necessitating error-free conversions of Frame docs to RTF or Word. I'd be more inclined to purchase a new tool for on-line help than purchase something that allowed me to use my old tool. > 4. Documents created in Frame may have to be repurposed using Word (e.g., > training materials produced by departments other that don't use Frame. This is a real requirement, though it doesn't involve round tripping. There are already reputable tools for doing an Omni-directional conversion, (though their names currently escape me :-). > If the market for Frame products is to broaden, its > capability for round-trip conversions to/from other formats must be expanded > and improved. The most likely source of such conversion tools is third-party > software vendors like SEMA, Omni Systems, Blueberry, Quadralay, and (in the > case of SGML/XML) OmniMark. That only alleviates the symptoms - it doesn't cure the illness. A better way to is for application designers to agree on (or be forced into) standardisation of an interchange format. Adobe is there, and Microsoft have committed to support XML (IE5 does now). Although tools will always be required for legacy documents, the need for new filters will surely diminish at a rate proportional to the uptake of XML. > It now appears that Adobe plans to issue a major new release of Frame about > once every two years (provided they can find a way to avoid bug-ridden point > releases such as 5.5). Third-party software vendors of conversion tools are > on a much shorter release schedule, because the nature of their business > demands it. Adobe would be better off if it subsidized, or in other ways > supported, those third-party vendors rather than trying to develop adequate > conversion tools within the Frame product itself. Promotional deals could be > struck that offered these third-party products at a deep discount to Frame > license holders. Adobe aren't interested in subsidising third party tools or improving their own filters - they want to solve the problem. This might mean the eventual replacement of MIF by XML as the principal non-binary data source - would that be such a bad thing? -- Regards, Marcus Carr email: mrc@allette.com.au ___________________________________________________________________ Allette Systems (Australia) www: http://www.allette.com.au ___________________________________________________________________ "Everything should be made as simple as possible, but not simpler." - Einstein ** To unsubscribe, send a message to majordomo@omsys.com ** ** with "unsubscribe framers" (no quotes) in the body. **