Re: Subj: How to encode Hex10FFFF characters with UTF-16??

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Mar 10 2006 - 21:30:18 CST

  • Next message: Philippe Verdy: "Re: Need help in interpreting symbol 225e (measured by)"

    > Date/Time: Fri Mar 10 15:14:25 CST 2006
    > Contact: Kornkreismuster@web.de
    > Name: Kornkreismuster
    > Report Type: Error Report
    > Opt Subject: How to encode Hex10FFFF characters with UTF-16??
    >
    > Hi!
    >
    > I've got a problem to understand how it is possible to encode Hex10FFFF characters with UTF-16. If I try to calculate the range of UTF-16 I always get a maximum number of Hex10F7FF.
    >
    > Calculation:
    >
    > (DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
    > (High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)
    >
    > Please tell me how to encode Hex10FFFF characters.

    Your formula is wrong. Idon't know why you needed to invent it, given that the Unicode standard fully defines this.

    You just need to divide the supplementary code point into two parts:
    * take the lowest 10 bits and add the constant 0xDC00 to create the second (low) surrogate
    * shift out the lowest 10bits, substract the constant 0x40 (this is the value you get by shifting the first supplementary code point) and add the constant 0xD800 to create the first (high) surrogate.

    There are 0x400 high surrogates, and 0x400 low surrogates (each one containing 10 distinct bits), so their combination allows encoding 0x400*0x400 = 0x100000 supplementary code points, in addition to those (0x1000) in the BMP (if you include surrogate values which are not valid code points as they can't be represented in UTF-16).

    So UTF-16 can encode a total range with 0x110000 values, i.e. 17 planes exactly from U+0000 to U+10FFFF. Among them of course, you must exclude all surrogates (U+D800 to U+DFFFdo not exist as valid code points).

    All standard UTF's cover exactly the same set of valide code points. However some of these valid code points are permanently assigned to "non-characters", and must never be used to encode conforming texts (this is the case for example with the last 2 code points in each of the 17 planes).

    This restriction is what allows using "byte order marks" in some encoding schemes (but not in any encoding forms!) with the convention of prefixing the text encoded on a byte-stream with a constant valid character that has nearly never any meaning at the beginning of a text (U+FEFFwhich is a valid codepoint assigned to the ZWNBSP character): if you reverse the bytes of this character, you get the codepoint U+FFFE which is a valid code point, but not a valid character.

    There are 32 other non-characters in the block of Arabic presentation forms (for legacy reasons): they are also valid code points, but not valid characters.

    Note that all other code points currently not assigned to characters are NOT invalid (they areonly reserved for later assignments, so no one should produce documents containing them, before the characters are officially mapped in the ISO/IEC 10646 and Unicode standards). Applications must even consider that they encode valid character, even if this character is not known at the time when the application is written (that's why Unicode defines some default character properties, as it preserves the compatibility of applications with future versions of the ISO/IEC10646 standard); so applications should handle them as if they were encoding unkown graphic symbols, with weak directionality except in one area where they should be treated by default as right-to-left graphic symbols)



    This archive was generated by hypermail 2.1.5 : Fri Mar 10 2006 - 21:48:22 CST