How to encode Hex10FFFF characters with UTF-16??

Date: Thu Mar 16 2006 - 09:05:51 CST

  • Next message: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"

    Hi! Here is a small discussion I had privately.

     I've got a problem to understand how it is possible to encode
     Hex10FFFF characters with UTF-16. If I try to calculate the range of
     UTF-16 I always get a maximum number of Hex10F7FF.


    (DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
     (High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)

     Please tell me how to encode Hex10FFFF characters.




     Your formula is right, and so is Ken. There are 1024 x 1024 = 1048576
    code points accessible by surrogates, plus another 65536 in the BMP,
    you have to subtract the 2048 surrogate code points. These are
    permanently reserved because of their use in UTF-16.

    Doug Ewell
    Fullerton, California, USA
    Thank you very much for your response. Thought allready I'm dumb.
    So in the Unicode charts all characters above FFFF are double-coded by 
    themselfes and the surrogate-pairs.
    Can you also use the surrogate-pairs in UTF-32?
    No, nothing is double-coded. Each code point is uniquely identified by 
    a single Unicode Scalar Value, including those beyond FFFF. When using 
    UTF-16, they are encoded with a surrogate pair, while when using UTF-32, 
    they are encoded as a single 32-bit value.
    Take, for example, the character U+10000 LINEAR B SYLLABLE B008 A (�€€). 
    This is encoded as follows:
    UTF-8: F0 90 80 80
    UTF-16: D800 DC00
    UTF-32: 00010000
    It is an error to use the surrogate pairs in UTF-32, that is, to encode 
    the Linear B character above as 0000D800 0000DC00. (And, of course, it 
    is impossible to encode the hex value 10000 directly in a 16-bit word.)
    The practice of describing Unicode code points above FFFF in terms of 
    their surrogate pairs, instead of by the scalar value, dates back to 
    earlier years, when UTF-16 was considered the standard form of Unicode 
    and all others were considered "transformations."
    Please feel free to ask these questions on the list instead of 
    privately. I wanted to post this answer on the list, but that would 
    have been a violation of netiquette since your message was private.
    Doug Ewell
    Fullerton, California, USA
    Verschicken Sie romantische, coole und witzige Bilder per SMS!
    Jetzt bei WEB.DE FreeMail:

    This archive was generated by hypermail 2.1.5 : Thu Mar 16 2006 - 15:01:11 CST