About UTS#6: SCSU - 10. Possible Private Extensions: why not a "COBS" TES?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Aug 14 2004 - 08:21:08 CDT

  • Next message: Philippe Verdy: "Re: XML and Unicode interoperability comes before HTML or even SGML"

    About UTS#6: SCSU (A Standard Compression Scheme for Unicode).
    http://www.unicode.org/reports/tr6/tr6-3.5.html

    I know that this is not part of the SCSU standard, but the reference section
    10 about private extensions of SCSU seems to forget some other wellknown
    transport encoding syntaxes that allows transporting SCSU content within
    streams where usage of control bytes (like the null byte) is restricted.

    One well-known method is to apply a "COBS" encoding.
    See reference and implementation details in
    http://www.acm.org/sigcomm/sigcomm97/papers/p062.pdf

    It is MUCH better than the proposed method in section 10.1 that uses "DLE
    escaping", and the method is generic enough to allow escaping ANY byte value
    (not only the 0x00 byte):

       (1) When used with the default profile (which just avoids the null byte
    value), COBS allows avoiding any occurence of the null byte with the worst
    case producing not more than 1 byte every 254 source bytes, and no more than
    1 additional byte for any random source stream.

       (2) With an extended COBS profile, where N byte values need to be avoided
    in the encoded stream, the worst case produces only 1 additional byte for
    every (255-N) source bytes, and also no more than 1 additional byte for any
    random source stream. So this can be used to restrict the output stream to
    avoid ALL control bytes that are undesirable during transport, notably all
    C0 control bytes used by SCSU as "tags" (i.e. bytes 0x00-0x1F except
    CR=0x0D, LF=0x0A, TAB=0x09), or even all C1 control bytes (in 0x80-0x9F,
    notably the NL character).

       (3) A COBS profile that would avoid all C0&C1 control bytes except CR, LF
    and TAB would cost no more than 1 additional byte for every 226 bytes of
    SCSU-encoded source bytes: this worst case represents less than +0.5% of
    transported data size, still much better than the +100% you get in the worst
    case with the transport syntaxes suggested in 10.1!

        (4) COBS can be used as well to restrict the allowed bytes to the 7-bit
    range, making SCSU plus a COBS transfer encoding syntax in this COBS profile
    suitable for emails, and still much better than UTF-7 for Asian languages or
    multilanguage documents that largely benefit from the SCSU compression.

    A COBS profile can also handle the case of repeated byte values in the
    SCSU compressed stream (case discussed in section 10.2 of UTS#6).

    It also works much better than other well-known Transform Encoding Syntaxes
    like Base64 or Quoted-Printable, often used for emails but that behave
    poorly with Asian languages: these TES also have very poor worst cases (that
    can completely break the compression benefits offered by SCSU).

    Implementing COBS is also very straightforward, with very little CPU
    overhead (COBS will just need an internal buffering with a maximum of 254
    bytes with the default profile that avoids null byte values, which is very
    reasonnable, and easy to implement in low-cost hardware too).

    Because of these properties, there's no need to modify the standard SCSU
    algorithm: one just needs to apply COBS encoding directly on the output of
    the SCSU compressor. COBS appears then as a better solution than what is
    suggested in section 10.1 and 10.2 of TR6...

    Setting up COBS profiles is not necessary when implementing SCSU, so such
    extensions are really not needed. I would suggest that TR6 removes the
    section 10, and instead puts it into an annexe showing how a transport
    encoding syntax can be used to solve the suggested problems:

    The solutions exposed in section 10.1 and 10.2 are definitely not the best
    ones if one needs a good compression of Unicode, because their usage have
    very bad worst cases that double the size of the output stream.

    Another option would be to add section 10.3 referencing COBS as a better
    transfer encoding syntax, and saying that the existing 10.1 and 10.2
    solutions should better be modeled as simple transfer encoding syntaxes too,
    completely out of scope of the SCSU UTF itself, that really don't need such
    extensions in its core, where it will produce interoperability problems, now
    that it is a Unicode Technical Standard, to be implemented notably in XML or
    HTML parsers.



    This archive was generated by hypermail 2.1.5 : Sun Aug 15 2004 - 09:41:53 CDT