Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Tue Aug 18 1998 - 08:01:26 EDT


>We had this Latin-1-compatible-UTF discussions several times before, it
>is certainly in no way a new idea (check the news:comp.std.internat
>archives):
>
> - You will always need new software, no matter whether you use UTF-8 or
> any fancy encoding in which ISO 8859-1 file do not have to be recoded.
> So don't overestimate the practical advantages of the illusion of
> backwards compatibility. The advantage of backwards compatibility with
> ASCII in UTF-8 is only important because a number of ASCII characters
> such as NUL and SOLIDUS have special functions in software that is
> otherwise completely ignorant of the character set. No Latin-1
> character has such special semantics in any software I am aware
> of (I have yet to see a SHY implementation that can't be deactivated
> easily).
Well, I have programs that use iso 8859-1 characters for special functions.
But the most important advantages of backwards compatibility, is that
old programs that do not understand UTF-8 can still work while UTF-8
programs also work - on the same data!
It is an impossible situation where some text files are in iso 8859-1 and
some are in UTF-8. It will be a hopeless mess. And many UTF-8 programs
die or stop reading a file when it is in iso 8859-1.

>
> - UTF-8 has a large number of very neat properties that are not possible
> to get with any of the proposals for a Latin-1 compatible encoding,
> especially the combination of self-synchronization, the compactness
> (only up to 3 characters length) and the preservation of the UCS-4
> lexical string order (important for things such as B-trees in DBMSs).
For several languages, the adaptive encoding of UTF-8 I suggested, would be
more compact than pure UTF-8. And what I can see also self-synchronization
is retained. It may be that the lexical string order is not fully retained, but
any program needing that can have the text as pure UTF-8 (or UCS)
in the program. The important thing is that all normal text in files and other
simple storage (that is not databases) should be iso 8859-1 compatible.
Data storage that needs special software to access the storage device
(like databases) can have any encoding they like internally, it is always
accessed through the special software. A normal file can be accessed and
written by many tools and must then be in a standard format that most programs
can handle.

>
>If you really need a Latin-1 compatible UTF, then just use UTF-7 but do
>not transform the characters in the 0x80-0xff range. This is a straight
>forward modification of UTF-7 and it costs you just one or two bytes to
>change in an UTF-7 implementation. This technique is so obvious and
>trivial that it is not even worth to write a formal specification for
>it.
>
>I hope it will not become popular. Another UCS encoding is certainly not
>what the world has been waiting for.
I agree, UTF-7 could be possible but is not wanted. My adaptive UTF-8 is
really UTF-8, just that the software accepts not UTF-8 encoding sequences
when reading and using iso 8859-1, if possible, when writing. Could easily
be incorporated into existing UTF-8 software.

   Dan

--
Dan Oscarsson
Telia Prosoft AB                       Email: Dan.Oscarsson@trab.se
Box 85
201 20  Malmo, Sweden



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT