Re: UCS-2

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Apr 27 2000 - 10:11:16 EDT


Samir Mehrotra <Samir.Mehrotra@mail.iflexsolutions.com> wrote:

> I have a doubt about the the storage patterns of the different
> character sets of Unicode, especially UCS-2. UTF-8 can store upto Six
> Bytes / character (however the maximum for any character as of now is
> 3 bytes....),

It is expected that no characters will ever be assigned in Unicode that
require the five-byte and six-byte UTF-8 forms, so you can consider the
maximum for UTF-8 to be four bytes.

> What about UCS-2? does this character set requires two bytes for each
> and every character that is encoded in it or the number of bytes
> depends on character to character i.e, a character from English
> requires a single byte whereas a character from the CJKV requires 2
> bytes.

First, you should think in terms of UTF-16, not UCS-2, because UCS-2
does not allow the use of surrogates to encode those characters above
U+FFFF (really U+FFFD) that will be assigned in the near future.

Second, every character in UCS-2 requires two bytes regardless of code
point. In UTF-16, surrogate pairs (as mentioned above) require 4 bytes.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT