Re: UTF-9

From: John Cowan (
Date: Fri Oct 31 2003 - 21:48:26 CST

Mark Crispin scripsit:

> I keep on getting conflicting input on that point. In particular, I keep
> on hearing that ISO 10646 does have allocations in that space and people
> are using them.

Definitely not true. ISO 10646 tracks Unicode codepoint for codepoint,
and the *only* planes in use are 0, 1, 2, and 14, plus 15 and 16 which are
allocated for private use. The latest edition of 10646 doesn't even
have planes above 16 any more.

> I don't dispute that for all practical purposes, planes above 16 do not
> exist. In fact, for many (most?) practical purposes, planes above the BMP
> do not exist... :-)

Planes 1 and 2 are getting to be more important now.

> The point is, given that UTF-8 can express all of ISO 10646, I wonder if
> it'll be possible to block the non-Unicode planes should a large enough
> constituency emerge. The Internet and open source communities have a
> nasty habit of tearing up old agreements.

There just isn't any need for them. You can use any format you want,
but if it has more than 4 bytes it's not UTF-8 any more. And the chance
that the number of characters will go past 1.1 million is nil.

> The LINC and PDP-5/8/12 used SIXBIT extensively, especially for filenames
> (as did some PDP-10 operating systems but not Tenex or TOPS-20). SIXBIT
> was a 6-bit coded character set, using the corresponding ASCII glyphs from
> 0x20 - 0x5f for SIXBIT 0x00 - 0x3f.

That may have been SIXBIT-10, but on the PDP-8 the standard was to encode
U+0040 to U+005F as 0 to 037, and U+0020 to U+003F as 040 to 077.
That's what the SIXBIT assembler pseudo-op did, anyway.

> 0aaaaa 001aaaaa
> 1aaaaa 010aaaaa (if aaaaa is not 11111)
> 111111 0bbbbb 011bbbbb
> 111111 10bbbb 0000bbbb (if bbbb is not 1111)
> 111111 101111 01011111


Real FORTRAN programmers can program FORTRAN    John Cowan
in any language.  --Allen Brown       

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST