Re: U+xxxx, U-xxxxxx, and the basics

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Tue Mar 07 2000 - 03:10:43 EST


>UCS-2 is bogus because it isn't UTF-16. New implementations should not use
>UCS-2, since UTF-16 is a superset that allows for the surrogate characters.
>Supporting only UCS-2 will mean that your implementation breaks when Unicode
>3+ characters become official and get used (which will happen quickly
>because there are a bunch of additional Han characters in the first plane).
>In other words: when mentioning and describing UCS-2, deprecate its use
>clearly so that newcomers understand that they *need to* support UTF-16.

UTF-16 can be good for storing on file. When doing work on characters
UCS-4 should be better as you do not have to decode character points on the
fly.

>
>UTF-16 is also a requirement because there are a number of significant UCS-2
>implementations by now that need to support additional characters and
>re-architecting them is not an option, compared to providing a mechanism
>like UTF-16 to make them conformant. Oh, I forgot, we should replumb them
>all to use UTF-32 ;-)......
>

Yes, I am sure UTF-16 only exists to help the software that expected
16 bits were enough (that is one major mistake Unicode made).
If Unicode had started like UCS with 32(31) bits all had been well.
UTF-32 is just an other name for UCS-4 with the restriction that you do not
use the code points above the current Unicode upper limit.

There is no need to deprecate UCS-2, depending on the subsets of UCS
you want to support, UCS-2 might be better.

As Unicode now have recognised that 16 bits is not enough there is
no need to limit us to less than 31 bits. After all, to handle
code points above 16 bits of length, 32 bits is the most common
length easily available.

   Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT