Re: U+xxxx, U-xxxxxx, and the basics

From: Mark Davis (markdavis@ispchannel.com)
Date: Tue Mar 07 2000 - 10:34:31 EST


In the original design of Unicode, 16 bits actually was enough. That is because that design used composition, for example, coding a small block of Hangul Jamo instead of coding 11,172 combinations as separate Hangul syllables. It also only was intended to encode modern scripts -- archaic scripts and characters not in frequent modern use would have been private use, and interchanged by agreement. Of course, that design changed over time, both as a result of merging in the 10646 characters and in response to market needs.

However, had Unicode been designed as a 32-bit standard from the beginning, it would have been dead in the water. Companies would not have moved to a standard that required them to quadruple their storage requirements. Doubling was quite difficult enough! UTF-16 allows us to bridge that gap; it provides the capability of encoding all the characters we will ever need, in essentially the same amount of storage. The frequency of surrogate pairs is extremely low (actually zero right now, but very, very low in any future projections), which means the memory usage is effectively half of UTF-32.

In memory UTF-16 is a perfectly acceptable choice. While it is not as simple to process as UTF-32, it is not that difficult to handle. Many processes are transparent to surrogates being present or not, and the design has a number of nice features (such as non-overlap) that make it straightforward to deal with. And using it in memory halves your memory usage --even today, memory consumption has a strong link to overall performance.

One can even use a hybrid approach in memory; using UTF-16 for storage in strings, but also having a UTF-32 character type for when that is more convenient, such as looking up Unicode character properties. This is not to say that one shouldn't use pure UTF-32 in memory -- that may also be a reasonable choice, depending on the circumstances.

If we were to do it over all again, the only thing I would change about UTF-16 would be to put the surrogate blocks at the very end, and make the range only go up to FFFFF. However, that is water, way, way under the bridge. I wrote a bit more on this in http://www-4.ibm.com/software/developer/library/utfencodingforms (I seem to have mentioned that paper a lot lately, but I'd rather link to it than repeat myself in email.)

Mark

Dan Oscarsson wrote:

> >UCS-2 is bogus because it isn't UTF-16. New implementations should not use
> >UCS-2, since UTF-16 is a superset that allows for the surrogate characters.
> >Supporting only UCS-2 will mean that your implementation breaks when Unicode
> >3+ characters become official and get used (which will happen quickly
> >because there are a bunch of additional Han characters in the first plane).
> >In other words: when mentioning and describing UCS-2, deprecate its use
> >clearly so that newcomers understand that they *need to* support UTF-16.
>
> UTF-16 can be good for storing on file. When doing work on characters
> UCS-4 should be better as you do not have to decode character points on the
> fly.
>
> >
> >UTF-16 is also a requirement because there are a number of significant UCS-2
> >implementations by now that need to support additional characters and
> >re-architecting them is not an option, compared to providing a mechanism
> >like UTF-16 to make them conformant. Oh, I forgot, we should replumb them
> >all to use UTF-32 ;-)......
> >
>
> Yes, I am sure UTF-16 only exists to help the software that expected
> 16 bits were enough (that is one major mistake Unicode made).
> If Unicode had started like UCS with 32(31) bits all had been well.
> UTF-32 is just an other name for UCS-4 with the restriction that you do not
> use the code points above the current Unicode upper limit.
>
> There is no need to deprecate UCS-2, depending on the subsets of UCS
> you want to support, UCS-2 might be better.
>
> As Unicode now have recognised that 16 bits is not enough there is
> no need to limit us to less than 31 bits. After all, to handle
> code points above 16 bits of length, 32 bits is the most common
> length easily available.
>
> Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT