Re: Unicode forms for internal storage

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jan 20 2004 - 14:54:29 EST

  • Next message: John Jenkins: "Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)"

    From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
    > Has anyone done any work on Unicode formats for this use-case? Does
    > anyone have any references or ideas to share?

    If you want something very simple to convert between UTF-8 and UTF-16, why
    not using them directly, by requiring a leading BOM and encoding the string
    using the shorter between UTF-8 and UTF-16, removing the BOM only if the
    UTF-8 string contains only 7-bit ASCII? As UTF-16 will need to start with a
    BOM, coded U+FEFF, i.e. with leading bytes {0xFE,0xFF} or {0xFF,0xFE}, there
    will never be any confusion between ASCII and UTF-16. Also no possible
    confusion between ASCII and UTF-8 with BOM, and between UTF-8 and UTF-16
    which have BOM coded differently.

    So you get the advantages of all worlds, without necessarily implementing a
    complex compressor like BOCU-1 or SCSU: your final encoded wtrings will be
    either:
    - 7-bit ASCII
    - 8-bit UTF-8 starting with a forced leading BOM
    - 16-bit UTF-16 starting with a forced leading BOM
    The cost is only the size of the BOM if coding something else than 7-bit
    ASCII: 3 bytes for UTF-8, 2 bytes for UTF-16. In all cases, the final
    encoding will be the shorter of the above 3 possible alternatives. Deciding
    which alternative to use can be performed in a single pass where you could
    the number of bytes needed for UTF-8 and UTF-16 without the BOM, and whever
    there are characters out of the 7-bit ASCII range (this allows you to
    allocate the final buffer to perform the actual encoding once you have
    determined the size of each approach).

    Finally, nothing forbids using a single compressor after this step (for
    example a deflate compressor without the GZIP parameters header, as
    implemented in zlib and Java), if this helps: as your string will start
    either with a leading ASCII byte or by a 3bytes UTF-8 encoded BOM, or a
    2bytes UTF-16 encoded BOM, you could also argue that the leading BOM may be
    removed and replaced by a single NON-ASCII byte. As you have 128 such bytes,
    the same byte can specify one of these meanings:
    - 0..127: ASCII byte, which is itself part of a string coded with 7-bit
    ASCII only
    - 129: indicates an uncompressed UTF-8 string, coded after this byte without
    the BOM
    - 130: indicates an uncompressed UTF-16LE string, coded after this byte
    without the BOM
    - 130: indicates an uncompressed UTF-16BE string, coded after this byte
    without the BOM
    - 192: indicates a compressed string, coded after this byte as a deflated
    stream of ASCII bytes
    - 193: indicated a compressed string, coded after this byte as a deflated
    stream of UTF-8 bytes without the leading BOM
    - 194: indicated a compressed string, coded after this byte as a deflated
    stream of UTF-16LE bytes without the leading BOM
    - 195: indicated a compressed string, coded after this byte as a deflated
    stream of UTF-16BE bytes without the leading BOM
    You can creates many variants of this for your internal storage...



    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 16:40:06 EST