> I divided the 16-bit UCS-2 codespace into 256 blocks of 256
> characters each. At any time, exactly one block was "selected in,"
> and the text consisted of characters from that block of 256 Unicode
> characters. The default block was 0x00 but could be changed by
> using an escape character followed by a byte representing the block
> to be selected. I think I chose 0x81 for the escape character,
> since it was unused in Windows CP1252 and the C1 meaning, HIGH OCTET
> PRESET, sounded close to what I was doing with it.
and Gunther replied:
> Your "stateful Unicode transformation seems just great to me,
> I was near to give up on a better UTF, but now that I heared about
> your encoding and added to it the code block overlay technique, I want
> to even more work out something that's better than UTF-8.
I would suggest to Gunther that before you go too far down the
crusade path, you familiarize yourself with Unicode Technical
Report #6, "Standard Compression Scheme for Unicode." SCSU uses
the same basic technique as Doug invented, although it selects in
half-blocks of 128 instead of full blocks of 256. Doug's use of
0x81 is mirrored in SCSU by the use of the SQU tag (used to quote
a Unicode character, although the value used for the tag is 0x0E instead
of 0x81). There are also locking shifts which allow you to either
lock in a half block (e.g. Hebrew) for single-byte encoding, or
to shift to Unicode mode for a stretch to be quoted in two-byte
form. And the "windows" used by SCSU default to Window 0 covering
the Latin-1 Supplement (e.g. U+0080..U+00FF), so that a straight
stream of Latin-1 data that avoids the control code values 0x01..0x08,
0x0B..0x0C, 0x0E..0x1F is *already* in SCSU compressed form.
Implementations of an SCSU encoder may choose to ignore the use
of multiple dynamic windows and to use simpler heuristics for
when to engage single quotes and when to engage locking shifts. This would
result in lesser compression, but simpler results, more along the
lines of what you are envisioning of as "UTF-sane". Latin-1 data
(except for the above-cited control values) can stay Latin-1 bytes,
and you can deal with all of the rest of Unicode in the same
Of course, you're up against the same problem that Doug pointed out--
that this is only as useful as it is widespread in actual software.
To my knowledge, the only implementations of SCSU (or its predecessor,
RCSU) are intended precisely for *compressions* of Unicode data,
and not as an otherwise general text interchange format. It is
quite unlikely that you are going to find terminals, browsers, or
other general text visualization software accepting SCSU as a
regular format for Unicode data, nor is it likely you will find
SCSU data passed around as a text encoding in Internet protocols
any time soon.
> But I am not talking about my
> personal software and hobby, I am talking about business empires where
> programmers are hidden behind meter-thick walls from the public
> guarded by legions of sales-people who don't make a big difference
> between a vacuum-cleaner and a computer program.
Many of the Unicode implementers on this list work for these
"business empires", and we aren't exactly hiding from the public.
What the consensus of opinion seems to be stating is that UTF-8 is
here to stay, it is standard and becoming widespread in implementation,
and that it isn't really that big a problem that it doesn't preserve
Latin-1 byte values to be readily legible on Latin-1 terminals.
As others have pointed out, UTF-8 is not that hard to auto-detect,
and some rather simple extensions to existing software could make
Ein unbekannter Locale Name wurde ├╝bergeben.
Ein unbekannter Locale Name wurde Řbergeben.
in a Latin-1 terminal.
(Note that I was able to cut and paste the UTF-8 string right into
the Latin-1 text editor that I am editing this mail in, without any
loss of data or complaint from the operating system.)
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT