Re: MSDN Article, Second Draft

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 18 2004 - 11:40:10 CDT

  • Next message: Rick McGowan: "New mail list for African script issues"

    May be a fourth level of abstraction is needed to complete what the MIME
    registry describes as "charsets": a TES (Transfer Encoding Syntax) sometimes
    happen at end, and some legacy specifications of CES mix it with what should
    have been left in a separate TES.

    For example, the specification of SCSU (Simple Compression Scheme for
    Unicode) defines it as a way to convert a stream of code points directly to
    a stream of bytes, without going through the level of abstraction of
    intermediate "code units" (or in this case, code units are simply the
    encoded bytes).

    This makes SCSU a legal CEF (like are UTF-32, UTF-16 and UTF-8) to convert a
    stream of encoded characters into a stream of (8-bit) code units, and a
    legal CES (like are UTF-32BE or UTF-32LE or UTF-16BE or UTF16-LE or UTF-8 or
    CESU-8, and UTF-16 or UTF-32 or UTF-8 or CESU-8 with a leading BOM) to take
    into account the generated byte order.

    But the SCSU specification speaks about "optional extensions" which are
    probably badly named because they should be better described as TES
    (DLE-escaping for NUL and DLE, or run-length compression, or COBS encoding),
    exactly like other well-known TES (Base64, Quoted-Printable) widely used in
    MIME contexts.

    I think that there still exists some other legacy charsets in the MIME
    registry that mix these level of abstraction, where a clear separation
    between CES and TES levels would have helped their interoperability. One
    cause of this descrepancy is that it has been, since long easier to create a
    new charset and have it registered in the long MIME registry, than to define
    a clear TES separately (the TES registry in MIME is not extremely long, and
    support for multiple TES in applications has often been very weak and not
    easily extensible, developers prefering to develop first the support needed
    to handle correctly the so many possible CES, just identified by their MIME
    "charset" identifier).

    The other related "problem" of TES is that many document structures
    (including XML) only offer a place to specify the "charset" (i.e. the result
    of a combination of a CCS, CEF and CES), but no place to specify the TES,
    which is left, apparently, to the transport protocol, ignoring the case of
    local storage where identification of TES is nearly impossible to make
    reliably... This means that local stores cannot benefit easily of the
    advantages of a TES specification (for example, when creating a reference to
    a text document, it's impossible to specify in the link that this document
    has been COBS-encoded or Base-64 encoded or even compressed in deflate or
    gzipped form, unless the local document is stored in an enveloppe format,
    such as a RFC2118 message with headers, and there's support in the hyperlink
    renderer to decode this enveloppe format transparently).

    For now, a hyperlink can specify the MIME-type of the document with an
    attribute specifying the "charset", i.e. the triplet <CCS,CES,CEF>, but no
    reliable and documented attribute to specify its TES (unless the document is
    transported via a email or with HTTP, and the source makes the job on the
    fly to transform it to the desired TES, which is a CPU-intensive job for
    servers that could be avoided if documents could be stored or cached
    directly by the server in their TES-encoded form; this means support in the
    server's storage to keep this out-of-band information).

    There does exists solutions but they are not universal and interoperable
    across distinct softwares working with the same physical document store:
    some filesystems offer that support with out-of-band meta-data, some servers
    will use private conventions with multiple file extensions and private
    server configuration files...

    If the document's TES encoding decoding could be handled directly by the
    client, without dependance of the underlying transport or storage
    technology, it would be easier.

    TES encoding is really out of scope of Unicode, but its support in various
    applications using encoded text documents should be enhanced. This includes
    a support for it in the XML and HTML document syntax, notably within source
    hyperlinks.

    As a final note: multiple TES encoding stages may be chained in any
    transport or storage, and changed on the fly across nodes in a transport
    network, without affecting the charset used for the decoded document. But in
    many applications, including HTTP, only one TES can be specified (else it
    will break other features such as document content signature and
    certification). I know no working implementation of any transport protocol
    that transparently allows specifying these multiple TES encodings (most
    often these steps are possible only in distinct layers of the transport
    architecture, where it can be made transparent for the applications handling
    encoded documents on the upper layers). This means that TES
    encoding/decoding affects the performance (and reliability...) of each
    relaying node in a transport network (such as proxies), a caveat avoided by
    including TES within a MIME charset, so that no TES encoding (or more
    precisely just a identity, do-nothing, "8-bit" TES encoding) will be
    necessary in the relaying chain...

    ----- Original Message -----
    From: "John Tisdale" <jtisdale@ocean.org>
    To: <unicode@unicode.org>
    Sent: Wednesday, August 18, 2004 5:27 AM
    Subject: MSDN Article, Second Draft

    Thanks everyone for your helpful feedback on the first draft of the MSDN
    article. I couldn't fit in all of the suggestions as the Unicode portion is
    only a small piece of my article. The following is the second draft based on
    the corrections, additional information and resources provided.

    Also, I would like to get feedback on the most accurate/appropriate term/s
    for describing the CCS, CEF and CES (layers, levels, components, etc.)?

    I am under a tight deadline and need to collect any final feedback rather
    quickly before producing the final version.

    Special thanks to Asmus for investing a lot of his time to help.



    This archive was generated by hypermail 2.1.5 : Thu Aug 19 2004 - 15:28:11 CDT