How to encode Hex10FFFF characters with UTF-16??

From: Kornkreismuster@web.de
Date: Thu Mar 16 2006 - 09:05:51 CST

Next message: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"

Previous message: Rick McGowan: "Re: Representative glyphs for combining kannada signs"
Next in thread: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"
Reply: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi! Here is a small discussion I had privately.

I've got a problem to understand how it is possible to encode
Hex10FFFF characters with UTF-16. If I try to calculate the range of
UTF-16 I always get a maximum number of Hex10F7FF.

Calculation:

(DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
(High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)

Please tell me how to encode Hex10FFFF characters.

Regards,

KKM

********************************************************

Your formula is right, and so is Ken. There are 1024 x 1024 = 1048576
code points accessible by surrogates, plus another 65536 in the BMP,
but
you have to subtract the 2048 surrogate code points. These are
permanently reserved because of their use in UTF-16.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
********************************************************
Hi!
Thank you very much for your response. Thought allready I'm dumb.
So in the Unicode charts all characters above FFFF are double-coded by
themselfes and the surrogate-pairs.
Can you also use the surrogate-pairs in UTF-32?
Regards,
KKM
********************************************************
KKM,
No, nothing is double-coded. Each code point is uniquely identified by
a single Unicode Scalar Value, including those beyond FFFF. When using
UTF-16, they are encoded with a surrogate pair, while when using UTF-32,
they are encoded as a single 32-bit value.
Take, for example, the character U+10000 LINEAR B SYLLABLE B008 A (ð&#65533;&#8364;&#8364;).
This is encoded as follows:
UTF-8: F0 90 80 80
UTF-16: D800 DC00
UTF-32: 00010000
It is an error to use the surrogate pairs in UTF-32, that is, to encode
the Linear B character above as 0000D800 0000DC00. (And, of course, it
is impossible to encode the hex value 10000 directly in a 16-bit word.)
The practice of describing Unicode code points above FFFF in terms of
their surrogate pairs, instead of by the scalar value, dates back to
earlier years, when UTF-16 was considered the standard form of Unicode
and all others were considered "transformations."
Please feel free to ask these questions on the list instead of
privately. I wanted to post this answer on the list, but that would
have been a violation of netiquette since your message was private.
--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193

Next message: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"
Previous message: Rick McGowan: "Re: Representative glyphs for combining kannada signs"
Next in thread: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"
Reply: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 16 2006 - 15:01:11 CST