Re: Unicode Myths

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 09 2002 - 19:33:24 EDT


David Hopwood provided various comments on Mark Davis' Unicode
Myths slides. I'm sure Mark will respond in some way, but I
have some counter-comments on the part having a bearing on
the Unicode character encoding model.

> Slide 5

I'm not sure what you on here about. Mark's Myth #5 was stated as "Every
Unicode code point represents a character", and the slide bullets are
just talking points towards explanation of the different major categories
for the code points; some of them encode characters, and some of them do
not.

Granted that the text of Unicode 3.0 is murky about all this. That is
all being worked on for Unicode 4.0, to bring the definitions in line
with the general framework of the Character Encoding Model. The revision
of Chapter 3, in particular, will not be available for UTC review for
awhile yet, but much of this is guaranteed to be more clearly stated
in the next edition.

The short synopsis of the "Standard Model" is:

abstract character

   Those entities which are to be encoded. They can be, in principle,
   anything, from letters of alphabets, to undisplayed format or other
   control functions, to roundtrip conversion clones. They are "what
   gets encoded".

repertoire

   A set of abstract characters.

codespace

   A range of nonnegative integers used for encoding. (For Unicode,
   this range is 0..0x10FFFF, inclusive.)

code point

   A value within the codespace.

encoding

   1. The process of associating ("mapping") abstract characters with
      code points.
   2. The result of associating a particular repertoire with code
      points. (aka coded character set, "CCS")

encoded character

   An abstract character together with the code point value it has
   been mapped to.

code unit

   A numerical unit associated with a fixed-width data type (generally,
   8-bit, 16-bit, or 32-bit, because of computer architecture
   considerations), used in character encoding forms.

character encoding form

   A mapping from the set of integers used in a CCS to a set of sequences
   of code units. (Unicode has 3 encoding forms: UTF-8, UTF-16, UTF-32.)

Unicode scalar value

   Because of the nature of the definition of UTF-16, not all code points
   in the Unicode codespace can be represented in the Unicode character
   encoding forms. And because of that, a concept called the Unicode
   scalar value is used; that refers to the subset of integers used in
   the Unicode CCS, namely 0..0xD7FF, 0xE000..0x10FFFF. The Unicode scalar
   values are the subset of integers in the Unicode codespace that
   constitute the domain for definition of the 3 Unicode encoding forms.

Code point categorization

   To make sense of the categorization of code points, I make use of
   three concepts: assignment, allocation, and designation.

   Assignment refers to the status of a code point as having an
   abstract character associated with it by the standard.

   Ordinary encoded characters (code points which have been mapped
   to abstract characters), control characters (code points which
   have been mapped to abstract characters which in turn are placeholders
   for control functions specified by other standards), and private
   use code points are all *assigned* code points. All *assigned*
   code points may have character properties associated with them,
   since they are associated with abstract characters.

   Note that the PUA characters are very funny animals, in that their
   entire meaning is undefined and they have no names, but they have
   to be treated as assigned characters to make sense. Effectively,
   we have

     <<Abstract character for private use 1>> assigned to U+E000
     <<Abstract character for private use 2>> assigned to U+E001
     ...
     <<Abstract character for private use 137469>> assigned to U+10FFFD

   Noncharacter code points, surrogate code points, and reserved
   (no mapping to an abstract character) code points are all *un*assigned.

   Allocation refers to the categorization of code points into types
   and subtypes for assignment.

   Allocation can span assigned and unassigned
   code points. Character blocks are pre-allocated spans of code points,
   conceptually associated with particular groups of characters.
   The Devanagari block is *allocated* to Devanagari. That means that
   the character encoding committees will only assign Devanagari script
   characters to the unassigned code points within the allocated block.
   (Not all blocks are so clear about their allocation semantics, but
   the general concept should be clear.)

   Designation refers to the formal specification of usage to a
   code point by the standard. All assigned code points, as well as
   noncharacter code points and surrogate code points, have their
   usage formally and normatively specified by the standard.
   Reserved code points, on the other hand, are *un*designated as to
   usage. They are simply reserved for future designation, and in
   principle could become any of the other designated types or even
   a new designated type that does not currently exist.

Now, in the context of that statement of the model, let me consider your
claims:

> - Unassigned characters are characters (clause C6 in Chapter 3 of the
> Standard notwithstanding).

This claim appears to make no sense -- but that is the result of
your different use of the term "character".

What I can stipulate is:

   Unassigned abstract characters are abstract characters.

   That is, we don't have to encode an entity to have given it a
   status of "thing to be encoded" as a character. It can be
   an acknowledged member of a repertoire before it is encoded.

   Unassigned code points are code points.

   This simply means that a code point does not have an abstract
   character mapped to it; it is surely a code point nonetheless.

What I could not stipulate would be:

   Unassigned code points are characters.

   Unassigned code points are neither abstract characters per se,
   nor do they have abstract characters mapped to them.

> Search the standard for "unassigned character"; it occurs several
> times.

This is mostly a careless usage in the earlier text, and in nearly
all cases will be replaced by "unassigned code point" in future versions.

> Also, several clauses and definitions are incorrect or
> incomplete if unassigned code points do not correspond to characters:
> at least C9, C10, C13, D1, D6, D7, D9 (which should not restrict to
> "graphic characters"), D11, D13, D14, D17..D24, D28, and the note
> after D29.

There are various infelicities in some of those clauses and definitions,
some of which have been addressed in Unicode 3.1 and Unicode 3.2, and
more of which will be clarified in Unicode 4.0. However, I disagree with
your main point, since, in principle unassigned code points *cannot*
"correspond to characters". That is contradictory with the concept of
assignment.

> - Format control characters are also characters.

Assuredly, yes. And I don't think Mark was claiming otherwise.

> - Private-use characters are definitely characters.

Also, yes. See above.

> - The values D800..DFFF are not valid code point values,

Incorrect. See above for the distinction between code point values
and Unicode scalar values.

> they are UTF-16
> code unit values (the valid Unicode code point space is 0..D7FF union
> E000..10FFFF.)

> In computer jargon, "characters" are, by definition, the things that are
> enumerated in coded-character-sets (regardless of whether or not they are
> displayed as simple spacing glyphs, have control functions, are not yet
> assigned, or have any other strange properties).

I would agree with this. This is what the Unicode Standard means by
"abstract character".

> Apart from the unfortunate
> "noncharacter" terminology (which would have better called "internal-use
> characters"),

No -- internal use *code points*.

> all valid Unicode code points *do* correspond to characters
> in this sense.

Incorrect. What I think you are trying to say, translated into my
terminology, is that all Unicode code points aside from surrogate
code points (your "invalid") and noncharacter code points correspond
to abstract characters. This is true for *assigned* code points,
of course, since that is what I *mean* by assigned. But it is not
true for the reserved code points, which are unassigned -- and which
thereby cannot be considered to be (associated with encoded) characters.

Using the term "character" for an unassigned, reserved code point
just blurs the terminological distinction between character and
code point unacceptably. U+70000 is assuredly a valid Unicode code point,
but it is not a *character* until and unless the UTC and WG2
assign something to it.

> Note that there is no conflict between this jargon meaning of "character",
> and its original meaning as a unit of text.

Well, no conflict if you mean that they are different usages, applicable
to different domains of consideration.

But they are certainly con*fus*ing and are commonly confused by people
who do not understand how character encoding standards work.

> While we're on this subject, it's also redundant to say "abstract character":

Nope. It is a deliberate usage to distinguish between
"character" as entity to be encoded and "character" as encoded entity.

> *all* characters are abstractions,

Of course. All the better then to identify them as abstract characters. ;-)

> and the definition of this term (D3 in
> Chapter 3 of the Unicode Standard) doesn't mean anything different to plain
> "character", as defined above.

Nope. Abstract character is a deliberately constrained term. "Character"
has multiple, and occasionally ambiguous usages in the text of the
Unicode Standard and in general discussion about character encoding,
even by the experts.

> Slide 29
> - there are 1,112,064 valid Unicode code points, not 1,114,112.
> (D800..DFFF are not valid code points.)

Nope.

Unicode has 1,114,112 code points.

There are 1,112,064 Unicode scalar values.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 20:32:14 EDT