Re: Surrogate points

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Jan 30 2005 - 16:18:53 CST

Next message: Eric Carwardine: "Arabic and HTML"

Previous message: Peter Constable: "RE: Surrogate points"
In reply to: Hans Aberg: "Re: Surrogate points"
Next in thread: Peter Kirk: "Re: Surrogate points"
Reply: Peter Kirk: "Re: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

> The numbers 0xD800-0xDFFF, 0xFFFE-0xFFFF are not associated with
> character, but included as place holders, never to be used, because
> one has failed to give the encoding UTF-16 a proper design. So an
> unrelated problem, choice of character encoding, is allowed to
> influence the logical core, the character set description.

Hans, these statements are both factually inaccurate and misguided, as
has -- to be frank -- practically every statement you have made about
Unicode in the two weeks since you started posting.

1. Surrogate code points 0xD800 through 0xDFFF

Until 1996, when Unicode 2.0 was released, the range from 0xD800 to
0xDFFF was part of a much larger range of unallocated code points,
called the "O-zone" in ISO/IEC 10646 terminology. That zone stretched
all the way from 0xA000 to 0xDFFF; at the time, that zone was not yet
used for precomposed Hangul syllables, Yi, or anything else.

The code points starting with 0xE000, on the other hand, belonged to a
"restricted" or "R-zone" which included allocations for compatibility
characters, as well as the Private Use Area (subsequently moved and
expanded) and the noncharacters 0xFFFE and 0xFFFF (more on these below).

When the decision was made to extend Unicode beyond 65,536 code points,
using a surrogate-pair mechanism called "extended UCS-2" in 1993 and now
known as "UTF-16," the largest unused block of 8,192 code points
available for surrogates was in the O-zone. The range starting at
0xD800, at the top of the O-zone, *may* have been chosen to minimize
sorting difficulties (since the R-zone characters were intended for
compatibility only and not expected to be common in Unicode text), but I
may be wrong on this.

In any case, it is incorrect to state that the choice of this block was
due to "failure to given UTF-16 a proper design." Other blocks, such as
the "obvious" 0xF800 through 0xFFFF, were already occupied.

2. Noncharacters 0xFFFE and 0xFFFF

The designation of 0xFFFE and 0xFFFF as "noncharacters" goes back to
Unicode 1.0 (1991), although that term was not used at the time. The
numeric value -1 has a long history of being used as a "sentinel" value,
to indicate the end of a series of real values. This works fine for
non-negative numeric data, such as inventory counts, but caused problems
in existing 8-bit character sets where the value 0xFF might have a real
meaning.

To solve this problem, Unicode 1.0 set aside the value 0xFFFF as NOT
corresponding to an actual character. This way, programs that used
16-bit values (i.e. all Unicode programs at the time) could safely use
it as a sentinel without fear of colliding with a real character
assignment. This was completely intentional.

The values 0xFEFF and 0xFFFE were also chosen quite intentionally, as a
byte-order mark and its byte-swapped version respectively. Studies had
shown that the byte sequences <FF, FE> and <FE, FF> were very rare at
the start of a plain-text stream in any existing character encoding. So
the BOM served two useful purposes from the outset, to identify the text
stream as Unicode and to indicate the intended byte order. The
overloading of U+FEFF as a zero-width no-break space was an effect of
the merger with ISO/IEC 10646, and also not related to UTF-16.

Claiming that either of these features of Unicode is the result of poor
design of UTF-16 is simply wrong. It is an uninformed opinion based on
inadequate consideration of the facts.

Hans, I don't know how long you spent on this list as a silent observer
("lurker") before you began posting, but evidently not long enough.

When I joined this list, I spent almost a year lurking before I made my
first post. I listened to the experts. I made plenty of wrong
statements of my own, but accepted the criticisms and corrections of
those who obviously knew more than I did. I learned the history of why
things are, and perhaps most importantly, I learned the importance of
Unicode's stability policies, which explain why it is TOO LATE to make
major architectural changes that would invalidate all existing
implementations.

While I admit a year may be excessive, I strongly suggest you take some
time off to READ the list, read the FAQ's, read the book (on-line or
hardcover), read the UAX's and UTS's and UTR's, and THINK about why the
Unicode Standard is the way it is, and what can -- and cannot -- be done
to change it. The choice is entirely up to you, but if you do not do
the necessary homework to draw reasonable conclusions and ask reasonable
questions, your posts will continue to reflect your lack of
understanding, and will be ignored by more and more people.

This is all I have to say on this topic, and I will not engage in a
flame war over it.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Eric Carwardine: "Arabic and HTML"
Previous message: Peter Constable: "RE: Surrogate points"
In reply to: Hans Aberg: "Re: Surrogate points"
Next in thread: Peter Kirk: "Re: Surrogate points"
Reply: Peter Kirk: "Re: Surrogate points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jan 30 2005 - 16:22:08 CST