Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: DougEwell2@cs.com
Date: Mon Jun 04 2001 - 03:10:35 EDT


In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> It would seem to me that there's
> another issue that has to be taken into consideration here: normalisation.
> You can't just do a simple sort using raw binary comparison; you have to
> normalise strings before you compare them, even if the comparison is a
> binary compare.

I would be surprised if that has even been considered. Normalization is one
of those fine details of Unicode, like directionality and character
properties, that may be completely unknown to a development team that thinks
the strict binary order of UTF-16 code points makes a suitable collation
order. This is a sign of a company or development team that thinks Unicode
support is a simple matter of handling 16-bit characters instead of 8-bit.

While we are at it, here's another argument against the existence of both
UTF-8 and this new UTF-8s. Recently there was a discussion about the use of
the U+FEFF signature in UTF-8 files, with a fair number of Unicode experts
arguing against its necessity because UTF-8 is so easy to detect
heuristically. Without reopening that debate, it is worth noting that UTF-8s
could not be distinguished from UTF-8 by that technique. By definition D29,
UTF-8s must support encoding of unpaired surrogates (as UTF-8 already does),
and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously represent
either the two unpaired surrogates U+D800 U+DC00 or the legitimate Unicode
code point U+10000. Such a sequence -- the only difference between UTF-8 and
UTF-8s -- could appear in either encoding, but with different
interpretations, so auto-detection would not work.

Summary: UTF-8s is bad.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT