CCS and CEF definitions in UTR #17

From: Mike Brown (mbrown@corp.webb.net)
Date: Fri Apr 21 2000 - 21:12:58 EDT


Sorry for not catching this last month when this thread was topical.

Keld Jørn Simonsen wrote:
> The specific codes for UTF-16 extension into plane 1-16
> is not allowed in UCS-4 (or in UTF-8 for that matter).

I'm trying to sort out a table of Unicode scalar values and their
corresponding UTF-16, UCS-4, and UCS-2 code value sequences. After
re-reading Keld's statement and finding some supportive evidence in the
UTF-16 amendment, I'm getting more confused, especially after consulting UTR
#17.

Here is what I am thinking:

The mapping of a repertoire of abstract characters to a set of non-negative,
not-necessarily-continguous integers is a coded character set. The set of
integers used in a coded character set comprises a code space ...or a
portion of a code space, if I read correctly, because of the possibility of
some of the integers in a code space not being assigned to abstract
characters. A literal interpretation here would seem to say that a coded
character set, by definition, cannot contain integers that do not map to
abstract characters.

So for example I could have in my code space all the integers from 0 through
9. I can say that 1 through 9 are assigned to abstract characters, while 0
is reserved for some special purpose. The coded character set would only
consist of the mapping of the abstract characters to integers 1 through 9.

A character encoding form, according to UTR #17, is "a mapping from the set
of integers used in a coded character set to the set of sequences of code
units". Continuing the example, I am left with the integers 1 through 9
mapping to, say, the 8-bit-wide code units 0xF1 through 0xF9. The integer 0
cannot be mapped to a code unit by this encoding form because it is not in
the set of integers used in a coded character set.

So it seems that a character encoding form, by definition, cannot map an
integer to a code unit sequence if the integer does not map back to an
abstract character. If this is correct, it would seem to have ramifications
for the definition of Unicode values, and hence, Unicode scalar values as
well. The Unicode values U+D800..U+DFFF, U+FFFE and U+FFFF wouldn't exist.

I know this isn't really true, but I'm having trouble reconciling these
definitions for the purposes of developing my table. Do the Unicode scalar
values U-0000D800..U-0000DFFF and U-0000FFFE..U-0000FFFF exist? I can see
how they wouldn't have corresponding code value sequences in UTF-16, which
is fine, but what about in UCS-2, UCS-4, and UTF-8?

Here is what I want someone to tell me:

1. The set of integers in a coded character set can include integers that
are not assigned to abstract characters.

2. Code unit sequences defined by a character encoding form can map to
integers that are part of a coded character set but that have not been
assigned to abstract characters.

   - Mike
___________________________________________________________
Mike J. Brown, software engineer, Webb Interactive Services
XML/XSL stuff: http://www.skew.org/ http://www.webb.net/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT