Re: Unicode forms for internal storage

From: Doug Ewell (
Date: Wed Jan 21 2004 - 02:12:18 EST

  • Next message: Doug Ewell: "Re: Unicode forms for internal storage"

    Elliotte Rusty Harold <elharo at metalab dot unc dot edu> wrote:

    > In developing such a format I have a couple of advantages:
    > 1. Most C0 controls are forbidden, and will not appear in the data.
    > That's already verified. If someone tries to pass in a C0 control
    > other than tab, linefeed, or carriage return to setValue, an
    > exception is thrown and the data is not stored. Potentially one or
    > more of these characters could be used as markers in the stream.

    Oooh. That could potentially be a problem with SCSU, since the SQU tag
    (needed to switch from single-byte mode to so-called "Unicode mode") is
    0x0F, and since characters in the range U+xx00 through U+xx1F (for any
    non-zero value of xx) stored in "Unicode mode" would store the LSB
    directly, conflicting with C0 controls.

    BOCU-1 might solve this problem, but multiplying and dividing by 243
    doesn't sound faster than UTF-8 bit-shifting. (I'm still amazed by the
    claim in UTN #6 that converting Hindi text between UTF-16 and BOCU-1
    took only 45% as long as converting it between UTF-16 and UTF-8.)

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 03:56:33 EST