Unicode and ISO terminology

From: J M Sykes (mike.sykes@acm.org)
Date: Sat Mar 03 2001 - 13:24:03 EST


Can anyone tell me whether there is any prospect of terminology being
harmonised or reconciled between Unicode and ISO 10646? A joint glossary
would be useful.

Obvious synonyms (e.g. byte vs. octet) don't bother me, but differences
between the structures apparently defined by the two sets of terminology do.

In particular, a careful (for me ;-?) comparison of the definitions in
Unicode 3 and those in 10646 leaves me wondering, for example, to what
Unicode term the ISO term "coded character" corresponds. The definition in
JTC1/SC2/WG2 N 2005 dated 1999-05-29 says:

4.8 coded character: A character together with its coded representation.

Although "coded representation" doesn't have its own definition, it seems
clear from its occurrence in the definition of "character boundary" that the
intended meaning is a sequence of octets, rather than an integer (or
perhaps, given the 10646 context, a "code position"), as opposed to a code
unit.

Thus, the definitions in 10646 don't seem to reflect the two (or more) stage
encoding process described in UTR#17. Even PDUTR#27 (Unicode 3.1) doesn't
correct the bullet point following definition D6 in Unicode 3.0. Nor that
surrogates are relevant _only_ in the context of UTF-16, which could easily
be missed when reading Unicode 3.0 section 3.7 "Surrogates".

Please don't trouble to tell me the history: what I don't know by now I'm
happy to remain ignorant of.

And yes, I have read "Character set considered harmful"
<http://www.w3.org/MarkUp/html-spec/charset-harmful>, and whole-heartedly
agree with it. But we can't change the past, however much we might like to.

My problem is that SQL has used a number of terms (such as character,
character set, repertoire) for a number of years, and I'd like to relate
them to well-defined terms from relevant sources, so as to be able to make
statements of the form:

When SQL says "character" it's using the 10646 definition, which is what
Unicode calls an "abstract character".

or, alternatively:

When SQL says "character" it means what 10646 defines as a "coded
character", which is what Unicode calls a "coded character representation"
(Def D6, briefly, a sequence of code units), and not what it calls an
"encoded character" (see para following definition D4, though I notice
"coded character" [sic] occurs twice in Unicode 3.1).

Now, I knew there were problems around here, even before the recent
discussion on the List. And I know SQL is going to have to live with them.
I'd just like to be confidant that we're doing the best we can.

Mike.

***********************************************************

J M Sykes Email: Mike.Sykes@acm.org
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire SK8 3SN
UK Tel: (44) 161 437 5413

***********************************************************



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT