Re: Nicest UTF

From: Doug Ewell (
Date: Fri Dec 03 2004 - 00:21:22 CST

  • Next message: Peter R. Mueller-Roemer: "current version of unicode font (Open Type) in e-mails"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > All UTF encodings (including the SCSU compressed encoding, or BOCU-8
    > which is a variant of UTF-8, or also now the GB18030 Chinese standard
    > which is now a valid representation of Unicode) have their pros and
    > cons.

    UTF's by definition are stateless and have exactly one valid
    representation for each code point. So SCSU, much as I like it, is not
    a UTF.

    BOCU-1 is also not a UTF, and in particular there is no conceivable way
    it can be regarded as "a variant of UTF-8." I have no idea what
    "BOCU-8" is. Maybe that one really is a variant of UTF-8.

    Though not promulgated by Unicode, GB18030 can be considered a UTF,
    since it is really just a mapping from Unicode code points to sequences
    of 1, 2, or 4 bytes.


    > SCSU is excellent for immutable strings, and is a *very* tiny overhead
    > above ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is
    > extremely trivial, may be even simpler than to UTF-8!)

    An ISO 8859-1 string that contains no controls except NUL, CR, LF, and
    Tab is *already* in SCSU. No conversion needed.

    I appreciate Philippe's support of SCSU, but I don't think *even I*
    would recommend it as an internal storage format. The effort to encode
    and decode it, while by no means Herculean as often perceived, is not
    trivial once you step outside Latin-1.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 00:22:34 CST