Re: UTF-9

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 30 2003 - 17:41:32 CST


From: "John Cowan" <jcowan@reutershealth.com>

> http://panda.com/tops-20/utf9.txt
>
> Res ipsa loquitur.

Are there still now platforms where storage bytes are not octets but nonets?
i.e. 9-bit based platforms? If so this proposal makes sense, but as a local
optimization for these platforms. Problems will code if you want to
interchange this data with the large majority of hosts that can handle a 9th
bit in their bytes.

This means that the interchange would require to send 2 octets to represent
each 9-bit byte without loosing data, or to use a complex bit pattern to
pack sequences of height 9-bit bytes into sequences of nine 8-bit bytes, and
with a way to interpret the last sequence (Such converters needed for
interoperability do exist: look for example at the MIME Base64 algorithm for
example which is used to pack sequences of 8-bit bytes into serialized
octets each with 6 significant bits).

UTF-9 seems interesting in this case, but is it worth the value as it is not
interchangeable directly with the most common networking technologies? Can't
you accept to loose 1-bit per storage byte?

What will happen then to a plain-text coded with UTF-9, and that is sent
through FTP? Do you mean that FTP should use a Base256 converter for 9-bit
platforms similar to Base64 for 8-bit platforms, to avoid loosing the most
significant bits of each transfered byte? How the recipient of the file
supposed to interpret the convereted data? Is it still plain text?

So if the format is not interchangeable, this UTF-9 transform looks like a
local-only transformation, and locally, each host can use its own
representation. And why not then a UTF-18 encoding scheme that would avoid
using UF-16 surrogates for all characters that fit in the first 4 planes?

For me, a UTF-18 encoding would make better sense if local optimization in
memory is the issue, as it will represent almost all existing Unicode
characters in planes 0 (BMP), 1 (SMP), 2 (SIP) and 3 (still not used, but
you may map instead the SSP plane 14 for tags and variation selectors, or
keep it for later use as SIP2) in one 18-bit code unit... But you'll still
need a converter to transform it to UTF-8 or a UTF-16 encoding scheme to
perform any I/O.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST