Re: [unicode] Re: UTF-c

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Feb 22 2011 - 15:29:50 CST

  • Next message: Koji Ishii: "RE: Titlecasing words starting with numeric glyphs and period as word separator"

    On 2/22/2011 12:06 PM, Doug Ewell wrote:
    > <mpsuzuki at hiroshima dash u dot ac dot jp> wrote:
    >
    >> The resynchronization on newline (or on ASCII punctuation)
    >> is needed, but I think today it is becoming insufficient
    >> gradually.
    > Again, it depends on the intended purpose of this (or any other)
    > encoding scheme. Resynchronization adds redundancy, which costs bytes.
    > If the goal is to minimize bytes, the encoding scheme has to strip away
    > as much redundancy as possible.
    >
    > Most people now suggest general-purpose compression as the "best" way to
    > compress Unicode text. Drop one byte out of a deflated or bzipped file,
    > and the resulting damage to the text will be arbitrary.

    It's the "best" if you
    a) have relatively large amounts of text
    b) don't need any synchronization (black box)

    SCSU was created for text that
    a) comes in relatively short snippets, each of which must be compressed
    individually
    b) might need to be "patched" in the compressed state (white box)

    It is always "best" to use an existing scheme, which, at this point,
    would include SCSU, unless there is some overriding critical need that
    can't be satisfied, even approximately, by anything else already defined
    and / or implemented.

    The biggest problem in the promulgation of data formats is that they
    increase the cost for everyone, because if they get implemented at all,
    sooner or later there will be data in that format that others will have
    to be able to read...

    A./

    PS: item b for SCSU might surprise some people, but SCSU is an extension
    of RCSU, which explicitly had this requirement.

    > Note that UTF-8, which has plenty of redundancy, was never represented
    > to be the smallest possible way to encode characters; it was only
    > represented not to be extravagant.
    >
    > --
    > Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    > RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 15:34:18 CST