Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 02 2004 - 15:33:11 CST

Next message: Kevin Brown: "RE: current version of unicode-font"

Previous message: Andrew C. West: "Re: current version of unicode-font"
In reply to: Arcane Jill: "RE: Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Reply: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There's no *universal* best encoding.

UTF-8 however is certainly today the best encoding for portable
communications and data storage (but it competes now with SCSU which uses a
compressed form where, on average, each Unicode character is represented by
one byte, in most documents; but other schemes also exist that use deflate
compression on UTF-8).

The problem with UTF-16 and UTF-32 is byte ordering, where byte is meant in
terms of portable networking and file storage, i.e. 8-bit in almost all
current technologies. With UTF-16 and UTF-32, you need to get a way to
determine how bytes are ordered in the code unit, as read from a
byte-oriented stream. You need not with UTF-8.

The problem with UTF-8 is that it will be most often inefficient or not easy
to work with within applications and libraries, that are easier accessing
strings and counting characters coded on fixed-width code units.

Although UTF-16 is not strictly fixed-width, it is quite easy to work with,
and is often more efficient than UTF-32 due to memory allocations.

UTF-32 however is the easiest solution when applications really want to
handle each possible character encoded on one Unicode code point with a
single code unit.

All UTF encodings (including the SCSU compressed encoding, or BOCU-8 which
is a variant of UTF-8, or also now the GB18030 Chinese standard which is now
a valid representation of Unicode) have their pros and cons.

Choose among them because they are widely documented, and offer good
interoperabilities within lots of libraries handling them with similar
semantics.

If you are not satisfied in your application by these encodings, you may
even create your own one (like Sun did when modifying UTF-8 to allow
representing any Unicode string within a null-terminated C string, and also
allow any sequence of 16-bit code units, even the invalid ones where
surrogates are unpaired, to be represented on 8-bit streams). If you do
that, don't expect this encoding to be easily portable and recognized by
other systems, unless you document it with a complete specification and make
it available for free alternate implementations by others.

----- Original Message -----
From: "Arcane Jill" <arcanejill@ramonsky.com>
To: "Unicode" <unicode@unicode.org>
Sent: Thursday, December 02, 2004 2:19 PM
Subject: RE: Nicest UTF

> Oh for a chip with 21-bit wide registers!
> :-)
> Jill
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Antoine Leca
> Sent: 02 December 2004 12:12
> To: Unicode Mailing List
> Subject: Re: Nicest UTF
>
> There are other factors that might influence your choice.
> For example, the relative cost of using 16-bit entities: on a Pentium it
> is
> cheap, on more modern X86 processors the price is a bit higher, and on
> some
> RISC chips it is prohibitive (that is, short may become 32 bits;
> obviously,
> in such a case, UTF-16 is not really a good choice). On the other extreme,
> you have processors where byte are 16 bits; obviously again, then UTF-8 is
> not optimum there. ;-)
>
>
>

Next message: Kevin Brown: "RE: current version of unicode-font"
Previous message: Andrew C. West: "Re: current version of unicode-font"
In reply to: Arcane Jill: "RE: Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Reply: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 02 2004 - 15:43:10 CST