Re: Unicode forms for internal storage - BOCU-1 speed

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Jan 22 2004 - 14:45:53 EST

  • Next message: Philippe Verdy: "Re: Unicode forms for internal storage - BOCU-1 speed"

    I think Markus was referring to UTF-8 in the context of the message as a
    "compression" format. And you would have to add that it is really good at
    ASCII-only...

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: <jcowan@reutershealth.com>
    To: "Markus Scherer" <markus.scherer@jtcsv.com>
    Cc: <unicode@unicode.org>
    Sent: Thu, 2004 Jan 22 10:50
    Subject: Re: Unicode forms for internal storage - BOCU-1 speed

    > Markus Scherer scripsit:
    >
    > > UTF-8 is useful because it's simple, and supported just about everywhere -
    > > but it's otherwise hardly optimal for anything.
    >
    > You entirely omit its principal advantage, sine qua non: it's maximally
    > ASCII-compatible, using bytes 0x00 to 0x7F to represent ASCII characters and
    > nothing else.
    >
    > Mark Crispin's UTF-9 (not to be confused with Jerome Abela's) is also
    > excellent, although most of us don't have 36-bit systems, for which it
    > makes sense. A precis:
    >
    > Code points (base 2) UTF-9 code units (base 2)
    > 0000000000000abcdefgh 0abcdefgh
    > 00000abcdefghijklmnop 1abcdefgh 0ijklmnop
    > abcdefghijklmnopqrstu 1000abcde 1fghijklm 0nopqrstu
    >
    > This is almost as good as Latin-1 for its repertoire, only minutely worse
    > than UTF-16 for the rest of the BMP, and beats all other encodings for the
    > other planes.
    >
    > --
    > John Cowan <jcowan@reutershealth.com>
    > http://www.ccil.org/~cowan http://www.reutershealth.com
    > Charles li reis, nostre emperesdre magnes,
    > Set anz totz pleinz ad ested in Espagnes.
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Jan 22 2004 - 16:39:47 EST