Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Sun Sep 17 2006 - 22:03:45 CDT

  • Next message: Don Osborn: "RE: Unicode & space in programming & l10n"

    Mark Davis wrote:

    > Frankly, I think the reason why SCSU and BOCU never got a lot of
    > traction is related to #1 on my list. That is, in the vast majority of
    > cases UTF-16 or UTF-8 have storage characteristics that are good
    > enough -- it's just not really worth taking extra steps to squeeze out
    > more.

    UTF-8 is practically always good enough for me, but then I'm not the one
    writing articles complaining about size "penalties" or ASCII
    compatibility. Apparently at least some people either have different
    storage needs, or haven't overcome the myths.

    > The only small-string compression scheme to gain fairly wide
    > acceptance, for different reasons, is PunyCode.

    I'm actually quite impressed with how elegantly and efficiently Punycode
    encodes URNs under the numerous constraints that that implies. But if I
    remember correctly, it's not suitable for arbitrary text, such as this

    > Of course, ZIP and related compressions do a pretty good job on any of
    > these languages encoding in Unicode, so they can be applied to reduce
    > sizes for any and all of them, in appropriate circumstances.

    The usual problem with general-purpose compression is that the output is
    no longer "text," but some sort of compressed blob that must be
    explicitly operated upon before it is usable as text. SCSU or BOCU-1
    text can be interpreted directly, without passing it through a separate
    decompressor, and I can even open and save SCSU-encoded text files
    directly in SC UniPad (thanks to the encoder and decoder I gave them
    years ago :).

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 22:06:55 CDT