Re: Unicode and ISO terminology

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 05 2001 - 18:48:46 EST


Mike Sykes asked:

> Can anyone tell me whether there is any prospect of terminology being
> harmonised or reconciled between Unicode and ISO 10646?

Gradually--over the long run. The Unicode Glossary has already added some
terminology from 10646, to make the usage of concepts like "planes"
clear. And the two committees deliberately worked to converge on
"supplementary characters" and "supplementary planes" for referring
to characters > U+FFFF, so as to avoid another layer of confusion for
10646-2.

However, some of the terminology in the Unicode Standard was *deliberately*
chosen to be distinct from 10646 years ago, and we live with the
consequences.

> A joint glossary
> would be useful.

An editor who volunteers to produce the joint glossary would also be
useful.

>
> In particular, a careful (for me ;-?) comparison of the definitions in
> Unicode 3 and those in 10646 leaves me wondering, for example, to what
> Unicode term the ISO term "coded character" corresponds. The definition in
> JTC1/SC2/WG2 N 2005 dated 1999-05-29 says:
>
> 4.8 coded character: A character together with its coded representation.
>
> Although "coded representation" doesn't have its own definition, it seems
> clear from its occurrence in the definition of "character boundary" that the
> intended meaning is a sequence of octets, rather than an integer (or
> perhaps, given the 10646 context, a "code position"), as opposed to a code
> unit.

I don't think that interpretation is a very meaningful one. Remember that
10646 was essentially SC2's first venture into multi-octet character
encoding. The model wasn't thought through all the way and reflected back
onto definitions.

The definition from time immemorial, for 8-bit character sets in SC2 is
still present in the 8859 series:

4.6 Coded character set; code: A set of unambiguous rules that establishes
a character set and the one-to-one relationship between the characters
of the set and their bit combinations.

In 10646's definition, the "code" was dropped from the definiendum, and
in the definiens, "bit combinations" was changed to "coded representation".
Why, I'm not sure, except possibly to make it easier to introduce a
definition of "coded character" -- which to some might have seemed an
obvious missing piece if a "coded character set" is conceived of as
a set of "coded characters".

>
> Thus, the definitions in 10646 don't seem to reflect the two (or more) stage
> encoding process described in UTR#17.

I agree. I don't think 10646 has/had a thought-through encoding model, any
more than the Unicode Standard used to. Both started groping in the dark
when it came to trying to explain "transformation formats".

>
> Please don't trouble to tell me the history: what I don't know by now I'm
> happy to remain ignorant of.

You'll get history nonetheless, since it is impossible to understand
the current text without understanding where it came from.

These texts were not created by axiomatic mathematical logicians, but
by standards editors trying to make use of prior text where they could
and often editing text in committees to reach compromises that might
not make sense later, when the implications of further additions come
clear.

> My problem is that SQL has used a number of terms (such as character,
> character set, repertoire) for a number of years, and I'd like to relate
> them to well-defined terms from relevant sources, so as to be able to make
> statements of the form:
>
> When SQL says "character" it's using the 10646 definition, which is what
> Unicode calls an "abstract character".

Which it should. (Except that the Unicode Standard has *two* concepts
of "abstract character", just to make things more complicated.)

>
> or, alternatively:
>
> When SQL says "character" it means what 10646 defines as a "coded
> character", which is what Unicode calls a "coded character representation"
> (Def D6, briefly, a sequence of code units),

If the SQL standard and other similar standards want to be au courant
regarding Unicode implementations, they need to fit against the following
model:

Standard Encoded Character Encoding Form Encoding Scheme

8859-1 10/12 NOT SIGN 0xAC 0xAC

Unicode U+00AC NOT SIGN 0x000000AC 0x00 0x00 0x00 0xAC
                                                    0xAC 0x00 0x00 0x00
                               0x00AC 0x00 0xAC
                                                    0xAC 0x00
                               0xC2 0xAC 0xC2 0xAC

          U+10300 0x00010300 0x00 0x01 0x03 0x00
          OLD ITALIC LETTER A 0x00 0x03 0x01 0x00
                               0xD800 0xDE00 0xD8 0x00 0xDE 0x00
                                                    0x00 0xDE 0x00 0xD8
                               0xF0 0x90 0x8C 0x80 0xF0 0x90 0x8C 0x80

Where the Encoding Schemes are sequences of *bytes*.
Where the Encoding Forms are sequences of *code units*.
Where the Encoded Character is a mapping of an abstract character against
   an (integral) code point in the code space.

Yes, with single-byte character encodings, it used to be so simple, and
one could ignore the difference between encoding forms and encoding schemes.
But in the Unicode Standard, there are now 3 encoding forms and 5 encoding
schemes (if you don't also count the variations involving BOM).

What other standards adapting to the Unicode Standard should not do is
seek terminological shelter from the real complexity that the bottom
half of this table demonstrates. All three columns and all of the rows
may be relevant to the correct use of Unicode in implementations
and standards.
                                        
Roughly the three columns correspond, in implementations, to character
identity, API binding, and byte streams. A standard needs to determine
what usage it is referring to.

> and not what it calls an
> "encoded character" (see para following definition D4, though I notice
> "coded character" [sic] occurs twice in Unicode 3.1).
>
> Now, I knew there were problems around here, even before the recent
> discussion on the List. And I know SQL is going to have to live with them.
> I'd just like to be confidant that we're doing the best we can.

And yes, it would help matters a lot if the Unicode Standard were completely
rewritten to make the character encoding model cleaner and the application
of terminology less confusing. That is one of the major tasks slated
for Unicode 4.0.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT