Re: Surrogate points

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Jan 30 2005 - 16:18:53 CST

  • Next message: Eric Carwardine: "Arabic and HTML"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > The numbers 0xD800-0xDFFF, 0xFFFE-0xFFFF are not associated with
    > character, but included as place holders, never to be used, because
    > one has failed to give the encoding UTF-16 a proper design. So an
    > unrelated problem, choice of character encoding, is allowed to
    > influence the logical core, the character set description.

    Hans, these statements are both factually inaccurate and misguided, as
    has -- to be frank -- practically every statement you have made about
    Unicode in the two weeks since you started posting.

    1. Surrogate code points 0xD800 through 0xDFFF

    Until 1996, when Unicode 2.0 was released, the range from 0xD800 to
    0xDFFF was part of a much larger range of unallocated code points,
    called the "O-zone" in ISO/IEC 10646 terminology. That zone stretched
    all the way from 0xA000 to 0xDFFF; at the time, that zone was not yet
    used for precomposed Hangul syllables, Yi, or anything else.

    The code points starting with 0xE000, on the other hand, belonged to a
    "restricted" or "R-zone" which included allocations for compatibility
    characters, as well as the Private Use Area (subsequently moved and
    expanded) and the noncharacters 0xFFFE and 0xFFFF (more on these below).

    When the decision was made to extend Unicode beyond 65,536 code points,
    using a surrogate-pair mechanism called "extended UCS-2" in 1993 and now
    known as "UTF-16," the largest unused block of 8,192 code points
    available for surrogates was in the O-zone. The range starting at
    0xD800, at the top of the O-zone, *may* have been chosen to minimize
    sorting difficulties (since the R-zone characters were intended for
    compatibility only and not expected to be common in Unicode text), but I
    may be wrong on this.

    In any case, it is incorrect to state that the choice of this block was
    due to "failure to given UTF-16 a proper design." Other blocks, such as
    the "obvious" 0xF800 through 0xFFFF, were already occupied.

    2. Noncharacters 0xFFFE and 0xFFFF

    The designation of 0xFFFE and 0xFFFF as "noncharacters" goes back to
    Unicode 1.0 (1991), although that term was not used at the time. The
    numeric value -1 has a long history of being used as a "sentinel" value,
    to indicate the end of a series of real values. This works fine for
    non-negative numeric data, such as inventory counts, but caused problems
    in existing 8-bit character sets where the value 0xFF might have a real
    meaning.

    To solve this problem, Unicode 1.0 set aside the value 0xFFFF as NOT
    corresponding to an actual character. This way, programs that used
    16-bit values (i.e. all Unicode programs at the time) could safely use
    it as a sentinel without fear of colliding with a real character
    assignment. This was completely intentional.

    The values 0xFEFF and 0xFFFE were also chosen quite intentionally, as a
    byte-order mark and its byte-swapped version respectively. Studies had
    shown that the byte sequences <FF, FE> and <FE, FF> were very rare at
    the start of a plain-text stream in any existing character encoding. So
    the BOM served two useful purposes from the outset, to identify the text
    stream as Unicode and to indicate the intended byte order. The
    overloading of U+FEFF as a zero-width no-break space was an effect of
    the merger with ISO/IEC 10646, and also not related to UTF-16.

    Claiming that either of these features of Unicode is the result of poor
    design of UTF-16 is simply wrong. It is an uninformed opinion based on
    inadequate consideration of the facts.

    Hans, I don't know how long you spent on this list as a silent observer
    ("lurker") before you began posting, but evidently not long enough.

    When I joined this list, I spent almost a year lurking before I made my
    first post. I listened to the experts. I made plenty of wrong
    statements of my own, but accepted the criticisms and corrections of
    those who obviously knew more than I did. I learned the history of why
    things are, and perhaps most importantly, I learned the importance of
    Unicode's stability policies, which explain why it is TOO LATE to make
    major architectural changes that would invalidate all existing
    implementations.

    While I admit a year may be excessive, I strongly suggest you take some
    time off to READ the list, read the FAQ's, read the book (on-line or
    hardcover), read the UAX's and UTS's and UTR's, and THINK about why the
    Unicode Standard is the way it is, and what can -- and cannot -- be done
    to change it. The choice is entirely up to you, but if you do not do
    the necessary homework to draw reasonable conclusions and ask reasonable
    questions, your posts will continue to reflect your lack of
    understanding, and will be ignored by more and more people.

    This is all I have to say on this topic, and I will not engage in a
    flame war over it.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Jan 30 2005 - 16:22:08 CST