# How to encode Hex10FFFF characters with UTF-16??

From: Kornkreismuster@web.de
Date: Thu Mar 16 2006 - 09:05:51 CST

• Next message: Mike Ayers: "Re: How to encode Hex10FFFF characters with UTF-16??"

Hi! Here is a small discussion I had privately.

I've got a problem to understand how it is possible to encode
Hex10FFFF characters with UTF-16. If I try to calculate the range of
UTF-16 I always get a maximum number of Hex10F7FF.

Calculation:

(DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
(High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)

Please tell me how to encode Hex10FFFF characters.

Regards,

KKM

********************************************************

Your formula is right, and so is Ken. There are 1024 x 1024 = 1048576
code points accessible by surrogates, plus another 65536 in the BMP,
but
you have to subtract the 2048 surrogate code points. These are
permanently reserved because of their use in UTF-16.

```--
Doug Ewell
Fullerton, California, USA
********************************************************
Hi!
So in the Unicode charts all characters above FFFF are double-coded by
themselfes and the surrogate-pairs.
Can you also use the surrogate-pairs in UTF-32?
Regards,
KKM
********************************************************
KKM,
No, nothing is double-coded. Each code point is uniquely identified by
a single Unicode Scalar Value, including those beyond FFFF. When using
UTF-16, they are encoded with a surrogate pair, while when using UTF-32,
they are encoded as a single 32-bit value.
Take, for example, the character U+10000 LINEAR B SYLLABLE B008 A (ð&#65533;&#8364;&#8364;).
This is encoded as follows:
UTF-8: F0 90 80 80
UTF-16: D800 DC00
UTF-32: 00010000
It is an error to use the surrogate pairs in UTF-32, that is, to encode
the Linear B character above as 0000D800 0000DC00. (And, of course, it
is impossible to encode the hex value 10000 directly in a 16-bit word.)
The practice of describing Unicode code points above FFFF in terms of
their surrogate pairs, instead of by the scalar value, dates back to
earlier years, when UTF-16 was considered the standard form of Unicode
and all others were considered "transformations."