Re: Nicest UTF

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Fri Dec 03 2004 - 04:54:58 CST

  • Next message: Antoine Leca: "Re: current version of unicode font (Open Type) in e-mails"

    On Thu, 2 Dec 2004 21:56:28 -0800, "Doug Ewell" wrote:
    >
    > This thread amuses me.
    >

    Me too, but then most threads on this list do ;)

    >
    > I also think that as more and more Han characters are encoded in the
    > supplementary space, corresponding to the ever-growing repretoires of
    > Eastern standards, the story that UTF-16 is virtually a fixed-width
    > encoding because "supplementary code points are very rare in most text"
    > will gradually go away.
    >

    More and more mostly very obscure and rarely used Han ideographs. It does not
    matter how many tens of thousands of additional CJK ideographs you add to the
    supplementary planes, the vast majority of CJK users will still get by quite
    happily with only CJK and CJK-A, which, as they are inherited from the important
    legacy CJK encoding standards, are what most CJK users have been living with for
    many years now. Of course people on this list, such as Richard Cook and myself,
    find endless use for obscure and archaic ideographs, but in writing day-to-day
    Chinese/Japanese/Korean there is no need to resort to CJK-B or CJK-C, except for
    certain idiosyncratic (U+24B62 CEI4 is my personal faourite) or dialectal
    usages, which are not typical.

    Now that the number of allocated characters in planes 1, 2 and 14 (45,718
    characters) is little fewer than the number of allocated characters in the BMP
    (57,129) (and soon it wil be greater), it is of course ridiculous to claim that
    Unicode is basically a standard for 16-bit characters, but despite the large
    number of supra-BMP characters they are, by definition, rarely used, and IMHO it
    will remain true that "supplementary code points are very rare in most text".
    That is not to say that I think that it is OK for people to be lazy, and just
    ignore everything outside the BMP. I strongly agree that all Unicode
    implementations should cover all of Unicode, and not just the BMP, and it really
    annoys me when they don't; but suggesting that you need to implement supra-BMP
    characters because they are going to start popping up all over the place is
    wrong in my opinion (not that Doug suggested that, but that's my extrapolation
    of his point). Software developers need to implement supra-BMP characters
    because some users (probably very few) will from time to time want to use them,
    and software should allow people to do what they want.

    Andrew



    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 04:56:39 CST