Re: Nicest UTF

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Dec 02 2004 - 23:56:28 CST

  • Next message: Doug Ewell: "Re: Nicest UTF"

    This thread amuses me.

    I feel like I know quite a bit about the various Unicode encoding forms
    and schemes, and my personal opinion is that UTF-16 combines the worst
    of UTF-8 (necessity to support multi-code unit characters, regardless of
    how "rare") with the worst of UTF-32 (high overhead for many scripts).
    Yet there is a Technical Note, UTN #12, that encourages users to use
    UTF-16 for internal processing, for exactly the opposite reasons.

    So I think the word "nice" is actually quite appropriate for this
    thread. It implies a personal aesthetic judgment, which is what is
    really being discussed here.

    I use UTF-8 for most interchange (such as this message; OE doesn't allow
    me to send UTF-16) and UTF-32 for most internal processing that I write
    myself. Let people say UTF-32 is wasteful if they want; I don't tend to
    store huge amounts of text in memory at once, so the overhead is much
    less important than one code unit per character.

    I do wish the following statements would stop coming up every time this
    subject is debated:

    (1) UTF-32 doesn't really guarantee one code unit per character, since
    you still have to worry about combining sequences.
    (2) Write functions that deal with strings, not characters, and the
    difference becomes moot.

    Both statements (which are really variations on the same theme) miss the
    point somewhat. Combining sequences and other interactions between
    encoded characters don't change the fact that sometimes you have to deal
    with strings, and sometimes you have to deal with individual characters.
    That's just the fact. Both types of processing are important.

    I also think that as more and more Han characters are encoded in the
    supplementary space, corresponding to the ever-growing repretoires of
    Eastern standards, the story that UTF-16 is virtually a fixed-width
    encoding because "supplementary code points are very rare in most text"
    will gradually go away.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Dec 02 2004 - 23:59:58 CST