Re: UTF-16 inside UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 03 2003 - 01:39:03 EST

  • Next message: D. Starner: "RE: UTF-16 inside UTF-8"

    Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:

    > Agree with you. Just want to make a point that the implementation is
    > not "< 1%" of the work.

    Oh, for heaven's sake:

    If you are starting with a NON-UNICODE application -- one that has NO
    prior knowledge of UTF-anything or UCS-anything -- and you are adding
    "Unicode support" to it, the amount of work to support the entire
    17-plane Unicode range compared to just the BMP is relatively small. If
    I ever said "less than one percent," I apologize. Such a figure can
    only be determined on a case-by-case basis.

    >> I'll be happy to supply UTF-8 code that handles 4-byte sequences.
    >> That is not the same thing as converting an entire system from
    >> 16-bit to 32-bit integers, or adding proper UTF-16 surrogate support
    >> to a UCS-2-only system. Of course that is more work.
    >
    > You view is based on the assumption the internal code is UCS4 instead
    > of UTF-16.

    Didn't you read what I wrote?

    > Nothing wrong if people choose to use UTF-16 instead of UCS4 in the
    > API, even as 2003. Do you agree?

    Sure, no problem. Both UTF-16 and UCS-4 (= UTF-32) support the full
    Unicode range. Only UCS-2 does not.

    > If people do use UTF-16 in the API, it is nature for people who do
    > care about BMP but not care about Plan 1-16 to only work on BMP,
    > right? I am not saying they do the right thing. I am saying they do
    > the "nature" thing. Remember, the text describe about 'Surrogate' in
    > the Unocde 4.0 standard is probably only 5-10 pages total in that 1462
    > pages standard. For developer who won't going to implement the rest
    > 1000 pages right, it is nature for them to think "why do I need to
    > make this 10 pages right?"

    I don't care if they choose not to provide fonts or rendering support
    for the supplementary planes. But it seems silly to deliberately
    exclude them from the underlying architecture.

    "Using UTF-16" implies that one supports the surrogate mechanism.
    UTF-16 without surrogate support is UCS-2.

    Of course the Unicode Standard doesn't spend a lot of time describing
    the surrogate mechanism. It only applies to the UTF-16 character
    encoding form. The description of characters encoded in the
    supplementary planes, however, is much more extensive.

    >> I can't fight this battle with people who would rather stay with
    >> ASCII and 7/8 bits per character. They are not living in a Unicode
    >> world.
    >
    > But how about the UTF-16 vs UCS4 battle?

    Well, UTF-16 certainly does occupy less space than UTF-32 (henceforth I
    will use this term instead of "UCS-4") in memory, on disk, wherever. By
    all means, when *storing* large amounts of data, use an appropriately
    compact form. That might mean UTF-16, UTF-8, or a compression format
    such as SCSU or BOCU-1, or it might mean compressing the data using gzip
    or bzip2.

    When *processing* character data in memory, I would assume a fixed-width
    encoding like UTF-32 would be more convenient than a variable-width
    encoding like UTF-16. But if the extra complexity (such as it is) of
    UTF-16 is not a problem, by all means go ahead and use it.

    >> I would truly be surprised if full 17-plane Unicode support in a
    >> single app could be demonstrated to be a matter of "multiple millions
    >> of dollars."
    >
    > It is not the full 17-plane Unicode support which will contribut to
    > it. It is the
    > (Number of ASCII only records X sizeof (records in UCS4)) - ( Number
    > of ASCII only records X sizeof(record in ASCII))
    >
    > contribute to that.
    >
    > compare to
    >
    > (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number
    > of ASCII only records X sizeof(record in ASCII))
    > or
    >
    > (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number
    > of ASCII only records X sizeof(record in ASCII))
    >
    > The other comparision is
    > (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
    > BMP only records X sizeof(record in UTF-8))
    >
    > (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
    > BMP only records X sizeof(record in UTF-16))
    >
    > of course, the sizeof() is really the "average size of record with
    > those data"

    I have never suggested that people with ASCII-only data should suddenly
    quadruple their storage needs by storing it all in UTF-32. That's what
    UTF-8 and SCSU are for. In fact, their data is already in UTF-8, isn't
    it?

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 02:13:16 EST