Re: Brahmic list ? (was: Oriya: mba / mwa ?)

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 30 2003 - 15:15:59 EST

  • Next message: Michael Everson: "RE: Oriya: mba / mwa ?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> Please don't use UTF-8 to encode anything other than Unicode code
    >> points.
    >
    > As long as I use it internally for intermediate processing, I can do
    > what I want. For now it is just a convenient way to represent variable
    > size integers up to 31 bits (in fact I use it to represent 32 bit
    > signed integers, but the two highest bits are equal).

    As long as you are sure that this will not leak out into the outside
    world, you are free to use the UTF-8 mechanism internally to represent
    any type of 31-bit data you like, including this private replacement for
    allkeys.txt. (You do know about allkeys.txt, don't you? And the fact
    that UCA is heavily customizable?)

    It would seem to make sense primarily for retaining ASCII compatibility
    and representing smaller values in fewer bytes than larger values, so
    you would want to be sure these are your design goals too.

    But things like this do have a tendency to leak into the outside world,
    and if this ever happens with your collation keys, you will have
    unleashed something like CESU-8 that fails the "duck test": it walks and
    talks like UTF-8, but it's not.

    > Of course if I still use it to represent something else thzn
    > codepoints in some published data or text, I will rename it and won't
    > keep the same charset label. But it's highly probable that this will
    > not be the most efficient representation (due to its byte-oriented
    > splitting), and a more compact or easier to process serialization
    > could require an alternate encoding scheme (or transfer syntax).

    This is a *much* better solution, whether it is the most efficient
    representation or not. CESU-8 is a classic and notorious example of a
    UTF-8-like encoding that could have been kept private and internal,
    where it belonged, but instead was "leaked" forcefully into the outside
    world, to the point where it was assigned an IANA charset label.

    UTF-8 can be auto-detected more or less reliably, and has achieved
    widespread use throughout the computing world. Please do not use it, or
    any extension of it, for representing anything other than Unicode code
    points.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 15:54:22 EST