Re: Rationale for U+10FFFF?

From: Harald Tveit Alvestrand (Harald@Alvestrand.no)
Date: Mon Mar 06 2000 - 17:04:07 EST


At 10:35 06.03.00 -0800, Markus Scherer wrote:

>UTF-16, therefore, does not need any range checks - either you have a BMP
>code point or put it together from a surrogate pair.

except if the second character isn't a surrogate, in which case you have an
erroneously encoded string. Range check required.

> UTF-8, on the other hand, comes with the problem that for a single code
> point you have (almost always) multiple encodings,

how?
with the rule in force that no unnecessary bytes be used for encoding, I
can't see a way to make multiple UTF-8 encodings of the same string.

There exist non-valid octet strings that an UTF-8 decoder that did no range
checks might turn into a number without an error message, but that's hardly
a strange thing.

> which makes string searching, binary comparison, and security checks
> (see sections in modern RFCs about embedded NUL and control characters)
> difficult when such "irregular" sequences are used.

The embedded NUL and controls are all outlawed in properly formed UTF-8.

>I am actuall trying to put together (for ICU) macros that do the same
>operations - get code point and increment, decrement and get code point,
>etc. - with all three UTFs, and doing it "safely" with all the error
>checking is quite a pain with UTF-8. I had to move most of it into
>functions because the macros became too large. Doing another check for the
>code point <=0x10ffff does not cause any significant performance
>degradation here.
>
>It is also widely believed that a million code points are plenty. This
>makes UTF-8 unnecessarily unwieldy. With hindsight (tends to provide a
>clear view!), it would have been better to design UTF-8 such that
>
>- a code point can be encoded only in one way

done, unless I missed something

>- at most 4 bytes are used

done with 17-plane limit

>- only the actual range up to 0x10ffff is covered

of questionable value - see previous discussion

>- the decoding is easier by having a fixed format for lead bytes instead
>of the current variable-length format that requires a lookup table or
>"find first 0 bit" machine operations

a fixed format that takes less than 1 byte?

>- the C1 control set is not used for multi-byte sequences (UTF-8 was
>designed to be "File-System-Safe", not "VT-100-safe"...)

that argument I agree with....

>All this is possible and easy with 64 trail bytes and 21 lead bytes.
>However, we have to live with a suboptimal UTF-8 where we need byte-based
>encodings - no one wants a new UTF for general purpose.

agreed.

              Harald

--
Harald Tveit Alvestrand, EDB Maxware, Norway
Harald.Alvestrand@edb.maxware.no



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT