L2/06-331

Dae: Wed, 11 Oct 2006
Source: Mark Davis
Subject: Unicode "flaws"

===

We have gotten a request from John Klensin to have a document that
distinguishes those features of Unicode that are "mistakes" (that we would
have done differently, if we had knew what we know now) and those that are
not. I'm of two minds as to whether this would be a useful document, but
here is a strawman from an email as a basis for discussion at the meeting
(or by email).

==

Like any human enterprise, Unicode has its flaws. Some of the flaws were
introduced as a result of compatibility with pre-existing encodings or
then-existing technology; some were simply mistakes that we would have done
differently; and some are not really flaws, but result from
misunderstandings

   - The separation of Latin/ Cyrillic/ Greek but unified Han is not a
   mistake; we would do it the same way, had we to do it over again.
   - Having the CJK compatibility characters be canonical variants
   instead of compatibility variants was a mistake
   - The mathematical variants -- in my view -- was a mistake (but others
   differ -- this was driven to a large part by the AMA). If we had to do it
   over again, I think a variation selection would have been better.
   - The "fixed position" classes for some of the combining marks (such
   as Hebrew points) was a mistake, and we would have done it differently
   - The other classes (Above, Above Right, ...) were not a mistake
   - UTF-16 was not a mistake. In processing, it offers many advantages
   over UTF-8, so for some environments it is more useful. However, for
   interchange UTF-8 is clearly better, both for compatibility with ASCII and
   in terms of average bandwidth.
   - The BE vs LE forms of UTF-16 is an interesting case. From a purely
   practical point of view, it is a mistake. On the other hand, it would have
   been extremely difficult to get the same rate of early adoption that made
   Unicode a success without it -- and for that matter, without the original
   16bit form.
   - UTF-8 and UTF-16 have different binary sort order (for supplementary
   characters). This is a mistake -- unfortunately we had not anticipated the
   need for supplementary characters originally; if we had, then the surrogate
   range would have been at the top of the 16bit range, and sorting would have
   been identical. But by the time we realized that, the area was already
   taken, so D800..DFFF was the best we could do.
   - 17 planes of Unicode. A mistake -- in my view. The range 0 to
   0xFFFFF was plenty. The math is slightly easier with 17 planes, but not to
   make up for taking "part" of a bit in the representation.
   - Collation != codepoint order; this is not a mistake. First, many
   languages share characters but don't sort the same way -- even within the
   same language this is true. Second, collation order may change over time (cf
   Spanish), or depend on environments (German phonebook ordering). Third, code
   point order is not sufficient anyway, since you need multiple levels of
   sorting to get the right answer.
   - Combining characters. Not a mistake (addressed elsewhere).
   - Normalization. If we had started with a completely blank sheet, it
   would not have been necessary -- we wouldn't have encoded compatibility
   variants, and we would have always used the NFD form. As a matter of fact,
   that was the very earliest design of Unicode. However, there was a huge
   amount of legacy data, and practicalities required us to include multiple
   forms, since systems didn't use more sophisticated rendering such as
   OpenType at the time.
   - Overloading BOM and ZWNBSP. This was a mistake, but one that we were
   forced into because of the 10646 & Unicode merger. We have been able to
   disentangle those (slowly) by encoding a replacement for the function of
   ZWNBSP.
   - Allowing lenient (non-shortest form UTF-8) on input but forbidding
   on output. This followed the "be lenient in what you accept principle", but
   in retrospect was a mistake. We did address this some years ago, and few
   implementations have this problem now.
   - Others?

See also Globalization
Gotchas<http://www.macchiato.com/slides/GlobalizationGotchas.ppt>