Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 02 2004 - 15:33:11 CST

  • Next message: Kevin Brown: "RE: current version of unicode-font"

    There's no *universal* best encoding.

    UTF-8 however is certainly today the best encoding for portable
    communications and data storage (but it competes now with SCSU which uses a
    compressed form where, on average, each Unicode character is represented by
    one byte, in most documents; but other schemes also exist that use deflate
    compression on UTF-8).

    The problem with UTF-16 and UTF-32 is byte ordering, where byte is meant in
    terms of portable networking and file storage, i.e. 8-bit in almost all
    current technologies. With UTF-16 and UTF-32, you need to get a way to
    determine how bytes are ordered in the code unit, as read from a
    byte-oriented stream. You need not with UTF-8.

    The problem with UTF-8 is that it will be most often inefficient or not easy
    to work with within applications and libraries, that are easier accessing
    strings and counting characters coded on fixed-width code units.

    Although UTF-16 is not strictly fixed-width, it is quite easy to work with,
    and is often more efficient than UTF-32 due to memory allocations.

    UTF-32 however is the easiest solution when applications really want to
    handle each possible character encoded on one Unicode code point with a
    single code unit.

    All UTF encodings (including the SCSU compressed encoding, or BOCU-8 which
    is a variant of UTF-8, or also now the GB18030 Chinese standard which is now
    a valid representation of Unicode) have their pros and cons.

    Choose among them because they are widely documented, and offer good
    interoperabilities within lots of libraries handling them with similar
    semantics.

    If you are not satisfied in your application by these encodings, you may
    even create your own one (like Sun did when modifying UTF-8 to allow
    representing any Unicode string within a null-terminated C string, and also
    allow any sequence of 16-bit code units, even the invalid ones where
    surrogates are unpaired, to be represented on 8-bit streams). If you do
    that, don't expect this encoding to be easily portable and recognized by
    other systems, unless you document it with a complete specification and make
    it available for free alternate implementations by others.

    ----- Original Message -----
    From: "Arcane Jill" <arcanejill@ramonsky.com>
    To: "Unicode" <unicode@unicode.org>
    Sent: Thursday, December 02, 2004 2:19 PM
    Subject: RE: Nicest UTF

    > Oh for a chip with 21-bit wide registers!
    > :-)
    > Jill
    >
    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
    > Behalf Of Antoine Leca
    > Sent: 02 December 2004 12:12
    > To: Unicode Mailing List
    > Subject: Re: Nicest UTF
    >
    > There are other factors that might influence your choice.
    > For example, the relative cost of using 16-bit entities: on a Pentium it
    > is
    > cheap, on more modern X86 processors the price is a bit higher, and on
    > some
    > RISC chips it is prohibitive (that is, short may become 32 bits;
    > obviously,
    > in such a case, UTF-16 is not really a good choice). On the other extreme,
    > you have processors where byte are 16 bits; obviously again, then UTF-8 is
    > not optimum there. ;-)
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Dec 02 2004 - 15:43:10 CST