Re: minimizing size (was Re: allocation of Georgian letters)

From: William J Poser (wjposer@ldc.upenn.edu)
Date: Thu Feb 07 2008 - 00:30:29 CST

  • Next message: William J Poser: "Re: minimizing size (was Re: allocation of Georgian letters)"

    >In a world where the "next million users" are making less than $2 a day and
    >are unlikely to be buying a computer anytime soon, and the majority of
    >cellular phones available will not support anything needing more than one
    >byte for most letters, I'd say that the "obsession" with size is no an
    >entirely outdated obsession....

    The existence of devices that only support single-byte encodings has
    no bearing on whether a character should be placed in a range that
    requires two bytes, three bytes, or four bytes in UTF-8. The UTF-8
    representation is unusable on such devices no matter what. The only
    way to deal with such devices is to use a single byte encoding for
    the relevant characters, that is, not Unicode.

    The existence of such devices does bear on which characters should be
    in the single-byte UTF-8 range, but since nearly everything necessarily
    lies outside of that range, there isn't much to be done about it.
    Positioning in the two byte range vs. the three byte range is
    something about which a little bit could be done, but remember that
    there are only 2,048 codepoints in the two-byte range, so there isn't
    all that much room. And how many devices support 1 and 2 byte codes
    but not 1 through 3? Does it really make a difference whether a
    writing system is in the two byte range or three byte range?

    I also note that my question is not only about placement within Unicode
    and how many bytes a character requires in UTF-8. It is about attempts
    to save a bit of storage more generally, such as the 24-bit encoding
    recently discussed.

    Incidentally, what is the nature of the limitation of cell phones to
    single byte encodings? Is there a technical reason for this, or is it
    merely that the manufacturers have thus far not felt much demand for
    multibyte encodings?

    >Also, when one looks at scripts side by side placed a decade ago for
    >arbitrary reasons that lead to any inconvenience on the part of those who
    >might want to use the script, it is preferable to have a better argument
    >than "just cuz" because if that were so the companies selling primarily in
    >countries that DO consider this to be an outdated notion could have
    >allocated according to putting the more emerging markets in the smaller
    >spaces and the more advanced ones in the three-byte area....

    Actually, "just cuz" is a very good argument. There are a variety of things
    that could have been done better, in hindsight. Some of them probably
    couldn't have been foreseen; some perhaps could have been. Nobody's
    perfect. But decisions had to be made, and for excellant reasons of
    stability, it isn't wise to change them too readily, so until we're
    ready to go to an incompatible ++Unicode, we're stuck with some
    arbitrary decisions, some of which may not have been optimal.

    Bill



    This archive was generated by hypermail 2.1.5 : Thu Feb 07 2008 - 00:34:31 CST