Re: Abstract character?

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Jul 23 2002 - 12:26:32 EDT


So far, the Unicode Standard has defined code points to be from the contiguous range of 0..0x10ffff.
Some definitions are fuzzy in the standard, with hopes of clarification in Unicode 4.0.

It is true that UTF-16 cannot encode <d800 dc00>, but it can encode <d800 0061 dc00>.

There are at least three reasons why not to forbid the representation
of surrogate code points in UTF-16 (and also UTF-32)
or the code-pointed-ness of surrogates:

1. Compatibility.
    UTF-16 was explicitly created to be backwards compatible with UCS-2.
    Valid UCS-2 text must be valid UTF-16 text.
    In UCS-2, code points d800..dfff were legal, so they must be in UTF-16.

2. Performance.
    When you iterate through a UTF-16/32 string, you don't want to forbid
    surrogate code points because it adds complexity to your logic.
    In fact, iterating through UTF-16 text currently does not produce any
    decoding errors.
    When you go through <d800 0061 dc00 d800 dc01> you get code points
    d800, 0061, dc00, 10001.

    Similarly, you don't want to forbid appending d800 to a string
    because the application might deliberately append code units
    (and dc00 would follow), or the application might just be blind
    towards surrogates and pass code units through one by one
    (UCS-2 application) with reasonable hopes that a surrogate pair
    would be rejoined by default.

3. Properties.
    An API that takes a code point and returns a property for that code point
    must be able to deal with surrogate code points because there are non-trivial
    properties assigned to them, e.g., general category Cs.

    Surrogate code points have been listed in the UCD for a long time,
    which shows that they are different from illegal code point values
    like 0x110000 or -1.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 11:44:17 EDT