[Date Prev][Date Next] [Thread Prev][Thread Next]
[Date Index] [Thread Index] [New search]

RE: Seeking SGML character entity to Unicode mapping/filter



AFAIK, there is no 100% official answer, if your question is:

  "How do I map ISO-8879:1986 character entities to Unicode,
   e.g., as hexadecimal numerical character references?"

Here are two almost-offical resources:

  ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/SGML.TXT

and:

  http://www.docbook.org/xml/4.1.2/ent/

The file at the Unicode site appears to have lost end-of-lines totally, but
a bit of massage with a text editor should make it legible.

The character entity files at the DocBook site are very complete and may be
your best bet. You could rather easily set up a sed or perl script to do a
translation based on these files.


If you just want a quick-and-dirty table of Latin 1 (ISO-8859-1) vs. HTML 4
character entities, there's of course
http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html.
To avoid common misconceptions about many of the Latin 1 characters, study
http://mirror.subotnik.net/jkorpela/latin1/index.html.
For a comprehensive coverage of the ISO-8859 alphabet soup, see
http://czyborra.com/charsets/iso8859.html.


To learn more about character code issues, see
http://mirror.subotnik.net/jkorpela/chars.html.
If you want to know more about Unicode, you could start at
http://www.cl.cam.ac.uk/~mgk25/unicode.html or http://www.unicode.org/.


If you want to play around with character sets, try
http://www.eki.ee/letter/.


In case you need to recode SGML or XML files, you may have to use a proper
parser in order to avoid recoding character data that shouldn't get recoded.
I have not tried this, but I would try James Clark's SGML-to-XML converter
'sx', http://www.jclark.com/sp/.


In case you have an OmniMark license or or still have one of the free
versions, there's the Unimap script at http://www.xmeta.com/omlette/.


For general recoding tasks, there's a nifty utility for recoding called
'recode', http://www.iro.umontreal.ca/contrib/recode/HTML/index.html. It
does not (yet?) recode from ISO-8879 (SGML) character entities. You can,
however, recode from HTML 4 character entities to Unicode (e.g. UTF-8) like
this:

  recode h4..u8 < inputfile > outputfile

get a table of the HTML 4 characters in Unicode like this:

  recode h4/test8..dump < /dev/null

and a lot more that you'll never ever need.

recode is available in most Unix/Linux/*BSD distributions. It is also ported
to Win32, http://www.weihenstephan.de/~syring/win32/UnxUtils.html, and
compiles out-of-the-box on Cygwin, http://sources.redhat.com/cygwin/.




Kind regards,
Peter Ring



-----Original Message-----
From: owner-framers@omsys.com [mailto:owner-framers@omsys.com]On Behalf
Of Erica Chapin
Sent: Friday, 13 April, 2001 7:33 PM
To: Jason Aiken; framers@omsys.com
Subject: RE: Seeking SGML character entity to Unicode mapping/filter


I would also be interested in this info, so if it
is not too specialized to be of general interest,
perhaps it could go to the whole list - or -
please include me on your reply to Jason.

and a good friday it is!
Erica

> -----Original Message-----
> From: owner-framers@omsys.com
> [mailto:owner-framers@omsys.com]On Behalf
> Of Jason Aiken
> Sent: Friday, April 13, 2001 10:07
> To: framers@omsys.com
> Subject: Seeking SGML character entity
> to Unicode mapping/filter
>
>
> Greetings FrameMuddlahs,
>
> I'm wondering if any of you SGML-crazed
> Unicode fanatics blessed with the
> pleasure of publishing in 10+ languages
> might have any idea where to look for a
> complete table or filter mapping SGML
> character entities to appropriate
> Unicode values. If you don't know what
> I'm talking about, consider yourself
> lucky. If you do, you know that
> Unicode-friendly tools are kinda nice (ahem!).
>
> For example, to map small zeta for
> Greek, I'd need a Unicode value for &zgr;.
>
> Surely someone has gone through the
> pain of making SGML applications work
> for Unicode stuff?
>
> Any insults, commiserations, scathing
> grammatical criticisms, cryptic clues,
> URLs, or actual assistance is deeply
> appreciated.
>
> Maybe I should go post this on the
> Adobe User-to-User forum for FM+SGML...
>
> Happy Good Friday the Thirteenth (muahahahah),
> Jason
>
>
> ** To unsubscribe, send a message to
> majordomo@omsys.com **
> ** with "unsubscribe framers" (no
> quotes) in the body.   **
>


** To unsubscribe, send a message to majordomo@omsys.com **
** with "unsubscribe framers" (no quotes) in the body.   **


** To unsubscribe, send a message to majordomo@omsys.com **
** with "unsubscribe framers" (no quotes) in the body.   **