Re: [unicode] UTF-c

From: mpsuzuki@hiroshima-u.ac.jp
Date: Sun Feb 20 2011 - 06:41:49 CST

  • Next message: Doug Ewell: "Re: UTF-c"

    Dear Thomas,

    On Sun, 20 Feb 2011 21:47:19 +1100
    Thomas Cropley <tomcropley@gmail.com> wrote:

    >I have developed a new multi-byte character encoding for Unicode. It is
    >similar to UTF-8 but it is more efficient at encoding non-ASCII alphabetic
    >scripts. The attached UTF-c.htm file gives more details and the C++ program
    >"UTF8_c.cpp" shows how UTF-c files may be processed.

    In your proposal, the maximum length of the coded character
    is 4, it is less than UTF-8's max length. It's interesting
    idea.

    I guess your proposal is designed for the convenience for
    the people who feels US-ASCII compatibility is insufficient
    and ISO 8859 variants compatibility is required.

    I have 2 questions:

    Q1) I guess, the easiest way for the people feeling like
        above is keeping to use existing ISO 8859 variants,
        not migrating to new encoding. Is there large group
        of the people who want to use both of ISO 8859
        compatible ENCODING, and the CHARACTERS out of them
        at the same time, and they are willing to switch
        their favorite softwares?

        In Japanese market, there is a large group of the
        people who want to use legacy ENCODING (like
        Microsoft Codepage 932) and the CHARACTERS out of
        them (like JIS X 0213:2004), but they cannot afford
        to pay for new softwares or don't want to migrate
        newer softwares. I'm interested in the situation
        in other countries.

    Q2) One of the advantage of UTF-8 encoding is an error
        recovery: breaking an octet will break a character
        including it, but following character won't be broken.
        But your encoding seems to be, sorry, unsafe.
        U+0080 - U+00BF, U+0100 - U+107F are coded by similar
        2 octets, so removing 1 octet may change all following
        characters. I'm afraid that it's not welcomed by
        the people who switched to UTF-8 from ISO 2022 encoding
        to reduce the engineering cost of the stateful encoding.

    Regards,
    mpsuzuki



    This archive was generated by hypermail 2.1.5 : Sun Feb 20 2011 - 06:45:54 CST