[Date Prev][Date Next] [Thread Prev][Thread Next]
[Date Index] [Thread Index] [New search]

Structured Document Design for XML or SGML



STRUCTURED AND UNSTRUCTURED DOCUMENTS ARE
THE SAME AT THE "ATOMIC" LEVEL

At this lowest level there are only document object types (e.g., paragraphs,
text ranges within paragraphs, graphics, equations, tables, cross-references,
markers) and sub-types (e.g., text paragraph, bulleted paragraph, numbered
list paragraph, section head paragraph, figure caption, table caption,
bolded or italicized text range, index marker), most of which, at least
in FrameMaker, are represented by descriptive tags.

Now, SGML purists would argue that, in a structured document, these "atomic"
object types and sub-types must be assigned names that describe their
content. Thus, if there are 25 content types, there would have to be 25 element
names for text paragraphs, 25 names for figure captions, 25 names for
bulleted paragraphs, and so on (and on and on and on, reductio ad absurdum).

I contend that this is not only unnecessary but also self-defeating. Elements
at the "atomic level should be given names that describe their
objectness (i.e.,object type/subtype), which is distinctly different from
formatting information. For example, Bullet_Item describes a paragraph
of sub-type bulleted item. If there is a compelling need (unlikely) to describe
the content type at this low level, then it should be done by assigning one
or more attributes for that purpose.

Incidentally one of the odd things about SGML purists is the way they
cling to the idea that a single (usually cryptic) element name is sufficient
to describe its content. Usually, content has many different facets.
It makes more sense (to me at least) to provide attributes for this purpose.
Not only does this approach to describing content make more sense,
it also makes the DTD much simpler, and less vulneragle to the impact
of evolving technologies and processes.

STRUCTURED DOCUMENTS BEGIN TO DIFFER FROM
UNSTRUCTURED ONES AT THE "MOLECULAR" LEVEL
Here, groups of "atomic" elements are wrapped in containers.
For example, a sequence of Bullet_Item elements would be wrapped
in a BulletList container, the elements that compose a Figure
(e.g., a Graphic element preceded or followed by a Figure_Caption
element) would be wrapped in a Figure container, and so on.
Although there are exceptions, most molecular-level container elements
of the types I'm describing here are actually "super objects"
that ought to also be given names that describe their objectness, not their
content. If necessary at this level, attributes should be used to
describe content.

THE ADVANTAGES OF UNIVERSAL BUILDING BLOCKS
The atomic and molecular elements described so far are the
universal building blocks of any structured document, no matter what
variations in content-oriented superstructure are imposed by different
DTDs . Ideally, everyone would agree on definitions and naming convention
for them so that this core element set could become common to all
future DTDs.  That would yield the following benefits:

1. The conversion of vast libraries of unstructured legacy documents to
structured ones would be greatly simplified. The first step is always the
conversion of tagged document objects to this core set of elements.
In the next pass, the core elements are wrapped in the superstructure
of content-named elements peculiar to each document type.

2. The conversion of structured documents (or portions thereof)
from one DTD to another would also be greatly simplified.

3. A quantum jump in authoring productivity would result, because,
at the atomic and molecular levels of structure, authors think in terms
of document objects, not content. If  the same element naming conventions
for these atomic and molecular elements were used in all DTDs, the learning
for becoming proficient with a new DTD could be substantially reduced.

USING ATTRIBUTES TO SPECIFY FORMATTING
Element context alone is usually not enough to define formatting. In my
EDD/DTD designs, I use formatting attributes at all level of structure, and
the combination of element context and attribute values determines the
formatting.

For example, formatting attributes for the ubiquitous Para element
might include:
ParaStyle Attribute
   Plain (default)
   Bold
   Italics
   Underlined
   Message (uses Courier font)
TextSize Attribute
   Large (2 points larger than regular).
   Regular--the font size in the default paragraph format (default).
   Small (2 points smaller than regular).
Width Attribute
   Across All Columns--text spans the sidehead and normal text columns.
   Normal--the text appears in the normal text column (default).
Alignment Attribute
   Left (default)
   Centered
   Right
The TblCellVertAlign Attribute - Para elements contained in a table cell 
have their
vertical alignment within the cell specified, as follows:
   Top (default)
   Middle
   Bottom

I know this approach gives SGML purists fits, but it allows the author to 
deploy a single
element named Para in virtually any context where a text paragraph is needed.
This approach, at least to me, makes more sense than using processing
instructions or other obtuse techniques to specify formatting.

FOR MORE INFORMATION
I have a 23-page PDF paper on this subject that elaborates on the issues
discussed here. To get it, go to:

http://www.microtype.com
When the page opens, click on Resources in the frame at the left.
Scroll down to Links to Tutorials and articles. You'll find it, as
well as 4 or 5 other papers that I've written.



====================
| Nullius in Verba |
====================
Dan Emory, Dan Emory & Associates
FrameMaker/FrameMaker+SGML Document Design & Database Publishing
Voice/Fax: 949-722-8971 E-Mail: danemory@primenet.com
10044 Adams Ave. #208, Huntington Beach, CA 92646
---Subscribe to the "Free Framers" list by sending a message to
majordomo@omsys.com with "subscribe framers" (no quotes) in the body.



** To unsubscribe, send a message to majordomo@omsys.com **
** with "unsubscribe framers" (no quotes) in the body.   **