Re: Is there a UTF that allows ISO 8859-1?

From: Roman Czyborra (czyborra@cs.tu-berlin.de)
Date: Tue Sep 01 1998 - 05:17:32 EDT


> SCSU uses the same basic technique as Doug invented, although it
> selects in half-blocks of 128 instead of full blocks of 256. There
> are also locking shifts which allow you to lock in a half block
> (e.g. Hebrew) for single-byte encoding.

Unfortunately SCSU only allows you to slide your 8bit window to full
multiples of 128 like SD8 0x0580 (=1F=0B) for Hebrew. You cannot
slide to 0x0570 which would be necessary if you wanted the 8bit value
=EO to mean U+05D0 HEBREW LETTER ALEF as in ISO-8859-8 Hebrew. SCSU
can compress all scripts covered by ISO-8859, but ISO-8859-1 Latin1 is
the only ISO-8859 charset that flows transparently through SCSU.

If SCSU allowed you to slide to offsets like 0x0350 for ISO-8859-6
Greek, 0x03E0 for ISO-8859-5 Cyrillic, 0x0570 for ISO-8859-8 Hebrew,
0x05E0 for ISO-8859-6 Arabic, 0x08E0 for ISCII Devanagari, 0x0DE0 for
TIS-620 Thai, and 0xFF40 for JIS-X-0201 Katakana, you would be able to
present monolingual text in SCSU by simply inserting a window shift in
the beginning and using the Unicode escape SQU =0E for the occasional
rare character not covered by those traditional charsets so that the
text still shows up perfect on Unicode-capable systems and remains
mostly readable on platforms using the older charsets.

[1] http://czyborra.com/scsu/
[2] http://czyborra.com/charsets/iso8859.html

> Ein unbekannter Locale Name wurde ├╝bergeben.
> Ein unbekannter Locale Name wurde Řbergeben.

> (Note that I was able to cut and paste the UTF-8 string right into
> the Latin-1 text editor that I am editing this mail in, without any
> loss of data or complaint from the operating system.)

That's because the Ř (U+00FC LATIN SMALL LETTER U WITH DIAERESIS) you
chose in your example happens to belong to the lucky 50% of non-ASCII
characters that are expressed with safe values [\xA0-\xFF] in UTF-8.
I doubt that you would be as successful with the companion capital ▄
which is encoded as =C3=9C ├ť



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT