Re: The Atomic Theory of Unicode

From: Jonathan Coxhead (jonathan@doves.demon.co.uk)
Date: Sat Jul 10 1999 - 00:55:22 EDT


   Ken Whistler wrote,

 | Jonathan Coxhead has supplied a long document introducing his
 | Atomic Theory of Unicode.
 |
 | [...]

   Thank you for taking the time to write such a gracious response,
and also to John Cowan. The factual corrections you pointed out have
been incorporated into the document, which can be found at
<http://www.doves.demon.co.uk/atomic.html>. I've also included the
longest English word you can write upside-down in Unicode:
'aftereffect'.

   I have a few general points to make ...

 | In almost every case consideration was given to encoding specific
 | characters to avoid having to encode enumerated lists of variant
 | forms of characters already in the basic repertoire. There were
 | extensive discussions of this, and what emerged was the consensus
 | agreement about where to draw the lines between generativity (through
 | separate encoding of combining marks) and enumeration (of small--or
 | even large--defective sets of variants).

   I recognise this: the principles you refer to are clearly visible to
a reader of _The_Unicode_Standard_, and they make a lot of sense: in a
nutshell (and without meaning to offend), "genuine" characters together
with compatibility extensions which are present for political and
commercial reasons.

   I wouldn't have thought of trying to analyse the various glyph
variants, characters using different fonts, ways of turning and
reflecting characters, etc, except for one thing: the proposal for
"mathematics in plain text". The entirety of the Latin alphabet is
going to be coded in italic, bold, open-face (etc) variants, with the
intention of using them as mathematical symbols. It seems to me to be
*inevitable* that if these characters are available, they will be used
in non-mathematical text. (I'd have used italics for 'inevitable', for
example, even if the spacing was a bit wrong.) "Rich text" will have
entered Unicode by the back door, and in a particularly inelegant way.

   Where it's possible to defend DOUBLE-STRUCK CAPITAL R as a symbol
intended only for use in mathematics when it is 1 of just 7 similar
symbols, the argument falls down when the whole double-struck alphabet
is coded, and with the explicit intention that it be used as such. Who
can put their hand on their heart and claim that these are not glyph
variants? (If anyone tries it, ask them why 'R' is used to represent
the set of real numbers. ---Because the word 'real' starts with 'r'!)
How then can 'n' encodings of Latin be justified when the difference
between Chinese, Japanese and Korean characters is described as one of
font only, and not encoded?

   However, this is a rhetorical question only, as the decision has
already been made. Except that ...

 | After that, it will simply be the normal
 | business of encoding the occasional oversight or oddball historic
 | character that someone turns up.

... and I am not convinced that mathematical characters do not continue
to appear in a productive way, as I described for EQUAL TO BY
DEFINITION, or that similar things do not happen in other fields
(including, of course general culture and advertising).

 | The term "Atomic Unicode" (aka "Cleanicode") has also been around for
 | nearly a decade, referring to Unicode as it "ought" to have been,
 | without any precomposed Latin, Greek, or Cyrillic, without piles of
 | compatibility characters of various styles, without ligatures, etc.,
 | etc. But while Cleanicode has had advocates, it has never gotten off
 | the ground because it is so patently obvious to the implementers that
 | they need all the other stuff in the standard to deal with the rest
 | of the data in the world in reasonably straightforward ways.

   Is there any documentation on this? An Alta Vista search yields 1
page, written before 1992. I had no idea I was reusing an existing
term.

 | Since I am opposed to this entire proposal in principle, I am not
 | going to argue the details of each individual character Mr. Coxhead
 | proposes as part of the analysis. I hope my response is taken as due
 | consideration, despite the fact that I will not engage further in the
 | details.

   Absolutely. I have no further axe to grind---I realised I was not
acting within the mainstream of Unicode development, but I wanted to
make the best case I could for the ideas I had. If those ideas are not
compelling to anyone else, then that's that, really! (And I have
certainly not been inundated with fan mail as a consequence. :-)

   A different case could be made: for a hybrid approach where only the
compatibilty decompositions would be changed, and they would would be
drawn from the extended set <black-letter> <capital letter tone>
<circle> <compose> <curl> <descender> <double-struck> <double>
<fraction> <heavy> <hook> <inverted> <italic> <large> <left-to-right>
<ligature> <line below> <line overlay> <narrow> <outlined> <palatalized
hook> <plinthed> <retroflex hook> <reversed> <ring> <rotated> <sans
serif> <script> <shadowed> <small letter tone> <small> <solidus>
<square> <stack down> <stack up> <stroke> <sub> <super> <triple>
<turned> <variant> <white> <wide>, which would *not* be encoded as
characters, and so not be available as productive units.

   This would at least allow "the story" of each character---for
example, the decomposition <palatalized hook> + LATIN SMALL LETTER T
for \u01AB---to appear in some form in the standard.

   However, I'm done. "Thank you, and good night."

        /|
 o o o (_|/
        /|
       (_/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT