Re: Rationale for U+10FFFF?

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Mar 06 2000 - 17:40:32 EST


Harald Tveit Alvestrand wrote:
>
> At 10:35 06.03.00 -0800, Markus Scherer wrote:
>
> >UTF-16, therefore, does not need any range checks - either you have a BMP
> >code point or put it together from a surrogate pair.
>
> except if the second character isn't a surrogate, in which case you have an
> erroneously encoded string. Range check required.
>

Yes: for each code unit, you need to check the range to be safe. The point is that with the UTF-16 design and the maximum code point value, every code point computed with UTF-16 is good and does not need to be checked further. Of course, you need to handle illegal sequences with both UTF-16 and UTF-8 (and you end up checking more code units with UTF-8).

> > UTF-8, on the other hand, comes with the problem that for a single code
> > point you have (almost always) multiple encodings,
>
> how?
> with the rule in force that no unnecessary bytes be used for encoding, I
> can't see a way to make multiple UTF-8 encodings of the same string.
>
> There exist non-valid octet strings that an UTF-8 decoder that did no range
> checks might turn into a number without an error message, but that's hardly
> a strange thing.
>

"...with the rule in force..." is just the problem. If you want to be safe, then you check for it. I do have "unsafe" macros for all UTFs that do not check anything beyond the decoding of correct and regular sequences. You cannot always use them, though.
Some people are concerned about a NUL character being encoded as an irregular C0 80 sequence, for example. Also, string comparisons are often done on the code units, without the code point decoding logic.

> > which makes string searching, binary comparison, and security checks
> > (see sections in modern RFCs about embedded NUL and control characters)
> > difficult when such "irregular" sequences are used.
>
> The embedded NUL and controls are all outlawed in properly formed UTF-8.
>

"properly", yes... - there is a fine line between "illegal" and "irregular" sequences, or between "impossible" and "undesired". With UTF-16, there are no irregular sequences because sequences of different lengths always produce code points of different values.

> >- the decoding is easier by having a fixed format for lead bytes instead
> >of the current variable-length format that requires a lookup table or
> >"find first 0 bit" machine operations
>
> a fixed format that takes less than 1 byte?
>

A fixed format inside the lead byte. UTF-8 uses a variable number of lead byte MSBs to indicate the length of the sequence, and it leaves a variable number of lead byte LSBs for the code point MSBs. Pretty inconvenient, although it allows a good code point range for short sequences.
One could, e.g., always use 2 bits for the length and another 3 bits for the code point MSBs. There are only 21 such combinations=lead bytes necessary (with 6 bits/trail byte), which allows to not use the C1 controls and still have 64 trail bytes.

Please don't take me wrong: UTF-8 is a good format for what it was designed for, and it was designed before UTF-16. It is a nice feature for its time that it could cover UCS-4.
Later, ISO and Unicode decided that 2G characters were overkill, 64k too little, 1M was enough. Considering that the handling of UTF-16 needs fewer operations and is generally easier, they set the max. code point accordingly.
If UTF-16, for whatever reason, is not suitable in some place, but UTF-8 is, then I am a happy UTF-8 user myself :-)
It is a world of compromise.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT