L2/06-331 Dae: Wed, 11 Oct 2006 Source: Mark Davis Subject: Unicode "flaws" === We have gotten a request from John Klensin to have a document that distinguishes those features of Unicode that are "mistakes" (that we would have done differently, if we had knew what we know now) and those that are not. I'm of two minds as to whether this would be a useful document, but here is a strawman from an email as a basis for discussion at the meeting (or by email). == Like any human enterprise, Unicode has its flaws. Some of the flaws were introduced as a result of compatibility with pre-existing encodings or then-existing technology; some were simply mistakes that we would have done differently; and some are not really flaws, but result from misunderstandings - The separation of Latin/ Cyrillic/ Greek but unified Han is not a mistake; we would do it the same way, had we to do it over again. - Having the CJK compatibility characters be canonical variants instead of compatibility variants was a mistake - The mathematical variants -- in my view -- was a mistake (but others differ -- this was driven to a large part by the AMA). If we had to do it over again, I think a variation selection would have been better. - The "fixed position" classes for some of the combining marks (such as Hebrew points) was a mistake, and we would have done it differently - The other classes (Above, Above Right, ...) were not a mistake - UTF-16 was not a mistake. In processing, it offers many advantages over UTF-8, so for some environments it is more useful. However, for interchange UTF-8 is clearly better, both for compatibility with ASCII and in terms of average bandwidth. - The BE vs LE forms of UTF-16 is an interesting case. From a purely practical point of view, it is a mistake. On the other hand, it would have been extremely difficult to get the same rate of early adoption that made Unicode a success without it -- and for that matter, without the original 16bit form. - UTF-8 and UTF-16 have different binary sort order (for supplementary characters). This is a mistake -- unfortunately we had not anticipated the need for supplementary characters originally; if we had, then the surrogate range would have been at the top of the 16bit range, and sorting would have been identical. But by the time we realized that, the area was already taken, so D800..DFFF was the best we could do. - 17 planes of Unicode. A mistake -- in my view. The range 0 to 0xFFFFF was plenty. The math is slightly easier with 17 planes, but not to make up for taking "part" of a bit in the representation. - Collation != codepoint order; this is not a mistake. First, many languages share characters but don't sort the same way -- even within the same language this is true. Second, collation order may change over time (cf Spanish), or depend on environments (German phonebook ordering). Third, code point order is not sufficient anyway, since you need multiple levels of sorting to get the right answer. - Combining characters. Not a mistake (addressed elsewhere). - Normalization. If we had started with a completely blank sheet, it would not have been necessary -- we wouldn't have encoded compatibility variants, and we would have always used the NFD form. As a matter of fact, that was the very earliest design of Unicode. However, there was a huge amount of legacy data, and practicalities required us to include multiple forms, since systems didn't use more sophisticated rendering such as OpenType at the time. - Overloading BOM and ZWNBSP. This was a mistake, but one that we were forced into because of the 10646 & Unicode merger. We have been able to disentangle those (slowly) by encoding a replacement for the function of ZWNBSP. - Allowing lenient (non-shortest form UTF-8) on input but forbidding on output. This followed the "be lenient in what you accept principle", but in retrospect was a mistake. We did address this some years ago, and few implementations have this problem now. - Others? See also Globalization Gotchas