RE: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Murray Sargent (murrays@microsoft.com)
Date: Wed Aug 12 1998 - 00:04:29 EDT


My 2 cents worth are that there are already too many Unicode transformation
formats. Everytime another is added, it remains unreadable by some software
and other software has to be revised to handle it. Such revisions can
introduce bugs, incompatibilities, etc. To keep things simple in the
plain-text world, a best bet is to use UTF-8, which everyone has to support,
which can be converted in blocks (unlike UTF-7), and which is well
documented. Such a single interface may be a bit inconvenient in various
cases, but you can rely on it since it ends up being so well tested.

Nevertheless, the major software companies do support many other formats, so
yet another Tower of Babel has to be dealt with. sigh

One thought for your particular problem is to use 8859-1 if all the
characters in a file belong to 8859-1 and to use UTF-8 if one or more don't.
Just be sure to remember which files are encoded using which codepage.
Putting a UTF-8 BOM (0xEF 0xBB 0xBF) at the beginning of a UTF-8 file is a
good way to identify the file as UTF-8. But it's not so clear how to ID an
8859-1 file unless it's embedded in a higher protocol.

On a private basis, you could have the convention that if it doesn't have
the UTF-8 BOM, then it's 8859-1. If you want to use your UTF-8 files with
software that only handles 8859-1, pass them through a converter (saving the
original) that will replace the non8859-1 characters by question mark.

Meanwhile, there's lots of software now that supports Unicode, e.g., Word 97
and Excel 97, so using Unicode might be more convenient than you first
thought.

Thanks
Murray

> -----Original Message-----
> From: Gunther Schadow [SMTP:schadow@aurora.rg.iupui.edu]
> Sent: Tuesday, August 11, 1998 6:04 PM
> To: Unicode List
> Subject: Is there a UTF that allows ISO 8859-1 (latin-1)?
>
> Hi,
>
> I recently discovered Unicode and I must say that it is great! I found
> out that the lower 8 bits of the Unicode are backwards compatible to
> ISO 8859-1 (Latin-1). Thus, if the high byte is zero, we would not
> really have to transmit it in messages. UTF-8 and UTF-7 does the trick
> for the old 7 bit ASCII set but requires me to render Latin-1 codes
> that have the high bit set unreadable by non-Unicode aware presentation
> programs. Also UTF-8 and UTF-7 require me to change all my ISO Latin-1
> texts to UTF. This is not satisfactory for a European who has produced
> lots of text in Latin-1 and who depends on Latin-1 aware but UTF-7/8
> unaware software. I wonder if there is no encoding like UTF-7 that would
> allow all lower eight bit to be set.
>
> I see this isn't possible with UTF-8, because the presence of the high
> bit encodes the escape to the multi-byte character code. But in UTF-7
> this would have been perfectly possible, because we use a full escape
> character rather than the high bit.
>
> I would like to know (1) if others feel the same concerns that there
> is one UTF missing, (2) if there are proposals out already, and (3)
> if such a proposal (much like UTF-7) would have a chance to be accepted
> by whoever is in charge of the UTF series (Unicode org? ISO?).
>
> regards
> -Gunther Schadow
>
> PS: how can I subscribe to the unicode mailing list?
>
> Gunther Schadow -----------------------------------
> http://aurora.rg.iupui.edu
> Regenstrief Institute for Health Care
> 1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
> schadow@aurora.rg.iupui.edu ---------------------- #include
> <usual/disclaimer>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT