SGML entity name algorithm

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Fri Jul 25 1997 - 22:16:30 EDT


It would be very nice to have a new ISO 10646-1 annex that defines an
algorithm to generate the "ISO 10646-1:1998//ENTITIES UCS//EN" SGML entity
table from the Unicode names for ALL Unicode characters automatically.

I remember that around a year ago, someone posted a nice algorithm to
generate fully automatically collision-free SGML entity names from the
Unicode names. This algorithm did not use any large tables of the form

  LATIN CAPITAL LETTER Z WITH ACUTE -> Zacute

Standardized character mnemonics are used in many existing applications,
for instance TeX, SGML, Postscript. So there seems to be a clear need for
well-defined ASCII mnemonics. At the moment, the Unicode standard does
not specify standard short names like those used in SGML. This could
be changed!

Having another huge table to define these names is clearly ugly. It has
already been demonstrated that it is possible to automatically generate
unique short character labels from the Unicode names with a very simple
algorithm. It is not easy to make the automatically generated mnemonics
100% identical to those in the SGML standard, so the new mnemonic list
will be slightly different (and more consistent!) at a few places.
But that is not a problem. The SGML standard is due for an update
anyway and SGML can handle easily different entity tables in its
powerful object naming system, so people can continue to use the
old entity tables if they want.

One suggested difference in "ISO 10646-1:1998//ENTITIES UCS//EN" would be

  LATIN SMALL LETTER U WITH DIAERESIS -> udia (and not uuml)

to stay consistent with Unicode terminology. Although, many of the
math characters would have different mnemonics.

Does this sound interesting?

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT