Re: Brahmic list ? (was: Oriya: mba / mwa ?)

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 30 2003 - 18:43:12 EST

  • Next message: Doug Ewell: "Re: msdos graphics"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > As I have not determined the correct size of these bitfields, I need
    > some intermediate solution to pack them a little, and the UTF-8 TES
    > (not the UTF-8 CES used by Unicode)venient for now, until I change it
    > to a better encoding, which may or may not leak out (I am not sure
    > that I need to make the encoding accessible from an interface, except
    > for debugging).

    I hope I understand the "venient" passage correctly.

    I'm pretty sure you mean "... the UTF-8 CES (not the UTF-8 CEF used by
    Unicode)..." A CEF maps code points to code units, and you don't mean
    that because you're not mapping Unicode code points.

    A CES, on the other hand, maps code units to bytes, and that *is* what
    you are doing with the code units in your internal mechanism: mapping
    them to bytes using the original 31-bit definition of UTF-8.

    A TES is a very specific thing. Apparently this term is reserved for
    mappings that explicitly solve a particular problem, such as MIME
    compatibility or compression. So quoted-printable is a good example of
    a TES, because it makes an arbitrary text stream -- already encoded in
    UTF-8, Windows code page 1252, or whatever -- transferable through
    mechanisms that support RFC 822, avoiding all of the bytes that mean
    something special. Likewise, Base64 is applied directly to an arbitrary
    byte stream, which means the data was already encoded in a CES before
    applying the additional Base64 layer.

    I've always had trouble with the assertion that SCSU (for example) is a
    TES rather than a CES. Certainly it solves a particular problem
    (compression) and avoids, to an extent, gratuitous use of bytes like 0D
    and 0A. However, it is applied to a sequence of *Unicode code points*,
    not code units, and certainly not bytes the way QP is. You don't take
    the UTF-8-encoded stream <C2 BF 51 75 C3 A9 3F> and encode *those seven
    bytes* in SCSU; rather, you encode the stream of five Unicode code
    points <00BF 0051 0075 00E9 003F>.

    That said, the definitions in UTR #17 were surprisingly difficult for me
    to wrap my brain around in general, so I might be off-base on some of
    this.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 19:14:51 EST