RE: Rationale for U+10FFFF?

From: Murray Sargent (
Date: Mon Mar 06 2000 - 17:29:34 EST

Hey guys, range checks are essentially as efficient as AND operations (in
C/C++ :-), namely

        if(IN_RANGE(n1, ch, n2))

where the IN_RANGE() macro is defined as:

#define IN_RANGE(n1, ch, n2) ((unsigned)((ch) - (n1)) <=
unsigned((n2) - (n1)))

For constant n1 and n2, this requires only an extra subtraction relative to
an if statement with an AND and requires only a single goto. So
performance-wise it's immaterial whether you use an AND or a range check in

Now in Java, it's a bit more painful since Java doesn't have unsigned....


> -----Original Message-----
> From: Harald Tveit Alvestrand []
> Sent: Monday, March 06, 2000 2:01 PM
> To: Unicode List
> Subject: Re: Rationale for U+10FFFF?
> At 10:35 06.03.00 -0800, Markus Scherer wrote:
> >UTF-16, therefore, does not need any range checks - either you have a BMP
> >code point or put it together from a surrogate pair.
> except if the second character isn't a surrogate, in which case you have
> an
> erroneously encoded string. Range check required.
> > UTF-8, on the other hand, comes with the problem that for a single code
> > point you have (almost always) multiple encodings,
> how?
> with the rule in force that no unnecessary bytes be used for encoding, I
> can't see a way to make multiple UTF-8 encodings of the same string.
> There exist non-valid octet strings that an UTF-8 decoder that did no
> range
> checks might turn into a number without an error message, but that's
> hardly
> a strange thing.
> > which makes string searching, binary comparison, and security checks
> > (see sections in modern RFCs about embedded NUL and control characters)
> > difficult when such "irregular" sequences are used.
> The embedded NUL and controls are all outlawed in properly formed UTF-8.
> >I am actuall trying to put together (for ICU) macros that do the same
> >operations - get code point and increment, decrement and get code point,
> >etc. - with all three UTFs, and doing it "safely" with all the error
> >checking is quite a pain with UTF-8. I had to move most of it into
> >functions because the macros became too large. Doing another check for
> the
> >code point <=0x10ffff does not cause any significant performance
> >degradation here.
> >
> >It is also widely believed that a million code points are plenty. This
> >makes UTF-8 unnecessarily unwieldy. With hindsight (tends to provide a
> >clear view!), it would have been better to design UTF-8 such that
> >
> >- a code point can be encoded only in one way
> done, unless I missed something
> >- at most 4 bytes are used
> done with 17-plane limit
> >- only the actual range up to 0x10ffff is covered
> of questionable value - see previous discussion
> >- the decoding is easier by having a fixed format for lead bytes instead
> >of the current variable-length format that requires a lookup table or
> >"find first 0 bit" machine operations
> a fixed format that takes less than 1 byte?
> >- the C1 control set is not used for multi-byte sequences (UTF-8 was
> >designed to be "File-System-Safe", not "VT-100-safe"...)
> that argument I agree with....
> >All this is possible and easy with 64 trail bytes and 21 lead bytes.
> >However, we have to live with a suboptimal UTF-8 where we need byte-based
> >encodings - no one wants a new UTF for general purpose.
> agreed.
> Harald
> --
> Harald Tveit Alvestrand, EDB Maxware, Norway

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT