Re: Repertoire, encoding, and representation (Was: Charsets + encoding + codesets)

From: John Cowan (cowan@drv.cbc.com)
Date: Tue Oct 07 1997 - 10:40:33 EDT


Kenneth Whistler wrote:

> The Unicode Standard talks about abstract characters. <a-acute> is an
> example of an abstract character in the Latin script. <d-dental-voiceless>
> is another example of an abstract character in the Latin script.

Well, this is very clear, and perhaps is the way things *should* be,
but I don't see that it's what the Standard says, and indeed, it appears
to me to directly contradict what the Standard says. To paraphrase
A.P. Herbert on Parliament: if the Unicode Consortium does not mean
what it says in the Unicode Standard, it must say so.

[much entirely correct stuff on combining character sequences
snipped]

> Keld is, of course, correct that the repertoire of abstract characters
> is open.

Unfortunately, this remark collides with these statements
on page 3-4, which are presumptively normative:

        A Unicode abstract character is represented by a single
        Unicode code value; the only exception [sic] to this are
        surrogate pairs (which are provided for future extension,
        but are not currently used to represent any abstract
        characters).

(Perhaps this paragraph is not normative, but if so I don't
see how to tell what parts of Chapter 3 are not normative.)

So the term "coded character representation", which is defined as
"an ordered sequence of one or more code values which is associated
with an abstract character", can only refer (in Unicode 2.0) to a
single codepoint or two successive codepoints forming a surrogate
pair. It *cannot* refer to a combining character sequence,
because (in general) a combining character sequence is not
"represented by a single Unicode code value". <d-dental-voiceless>
is not so represented, and is not an abstract character.

The term "abstract character" is only useful in Unicode 2.0
for lumping assigned non-surrogate codepoints and assigned
surrogate pairs (the latter being currently an empty set)
as corresponding to abstract characters,
and all other codepoints, including D800-DFFF,
as not corresponding to abstract characters.

I conclude, therefore, that there are at present 38,885 abstract
characters in Unicode, all of which are represented by single
Unicode code values. I wish it were otherwise, but it is not.
To make it so, the offending paragraph of page 3-4 would
have to be rewritten somewhat as follows:

        A Unicode abstract character can be represented by
        a single Unicode code value, or by a single surrogate pair
        (no surrogate pair currently represents
        any abstract character), or by one of several Unicode
        code values which are canonically equivalent, or by a
        combining character sequence, or a sequence of Hangul
        jamos representing a single syllable, or by other means.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
			e'osai ko sarji la lojban



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT