Re: UTF-9

From: Philippe Verdy (
Date: Fri Oct 31 2003 - 21:37:44 CST

From: "John Cowan" <>
> Mark Crispin scripsit:
> > I thought about UTF-18, but I couldn't think of a good way to represent
> > Unicode in 18 bits without surrogates. On the other hand, the idea to
> > 0/1/2/14 (BMP/SMP/SIP/SSP) in a UTF-18 is interesting.
> I agree, and think it makes sense.

My best choice would be to cover planes 0/1/2/3 with a single code unit for
UTF-18, expecting that a huge number of characters would have to be coded
soon in a second supplementary ideographic plane.

For your information, surrogates only exist in the BMP, not in the other
planes. They would have to be used in UTF-18 to cover the whole Unicode set,
using the same decomposition (of characters that don't fit in a single code
unit) as UTF-16, as it just simplifies things (this means that in UTF-18,
the high surrogate code units normally needed for characters in planes 1 to
3 would become invalid)

If ever someone resurrects 9-bit bytes in some new 72-bit RISC architecture
with very long parallel instruction formats on 144 bits (18 octets or 16
nonet-bytes), such idea would of course make a lot of sense, as well as
UTF-9... On such systems, the extra bit in each byte could be used on I/O as
a parity mark on bytes, or CRC code on words, and filesystems could be
updated to include a storage attribute for disks, specifying if these bits
are used to remap and verify octet-based data, or as plain storage to save
space. But the main issue could come from devices like IDE and SCSI disks.

If the computer speeds are continuing to grow, the autocorrectable CRC
capability of very high speed buses (including within the processor itself)
may become a requirement for all fast I/O operations, to help prevent the
bad effects of external electromagnetic pollution and bursts, while also
maintaining a good interoperability of legacy octet-based softwares...

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST