Re: Latin-1 compatible UTF-8 variant

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sat Aug 28 1999 - 06:02:09 EDT


Dan wrote on 1999-08-28 08:54 UTC:
> - not thinking about reality when constructing UTF-8 and UTF-8
> readers/writers. It is simple to make a UTF-8 reader that
> accepts ISO 8859-1 or ISO 8859-1 mixed with UTF-8.

Please not again. The flame wars on this very subject can be looked up
in the archive.

> It is simple to make a compacter version of UTF-8 using the base
> 256 character codes were possible (comacter for many languages).

No. If you think otherwise, you have completely misunderstood what UTF-8
is all about. Please read the section "What is UTF-8?" in

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

carefully then you will see, why a base256 transfer encoding lacks
essential properties that make UTF-8 so damn useful.

> If I today use most tools handling UTF-8 they will stupidly abort
> reading my files, because they are all in ISO 8859-1. And they will
> not write ISO 8859-1.

Abandon ISO 8859-1. Just switch over to UTF-8 by converting your entire
plain text file collection at once with something like

  find . -type f -exec recode latin1..utf-8 {} \;

Just as you did when you migrated from ISO 646-SE to ISO 8859-1.

> When I write my software that handles ISO 10646, it will be able to
> read UTF-8, ISO 8859-1 and ISO 8859-1 with embedded UTF-8. And it
> will be able to write ISO 8859-1 with embedded UTF-8 allowing
> the data to work both with ISO 8859-1 only tools, and still
> being able to handle full ISO 10646.

"ISO 8859-1 with embedded UTF-8"? Yuck!

(Yuck! = Yet another Unfortunate Coding: Kill it!)

> - only thinking ASCII, when thinking about backward compatibility.

Ah, I see where your misunderstanding might originate:

The difference between ASCII backwards compatibility and Latin-1
backwards compatibility is the following:

Many ASCII characters act as control characters, that represent not a
character but trigger a function by their presence. Examples are the '/'
in POSIX filenames, the '\' and '%' in C printf format strings,
the '\0' as a string terminator in C, all the lovely meanings that
almost any ASCII character has in Perl, etc.

Therefore, we have to be very careful with ASCII backwards compatibility,
because ASCII is so rich of characters with control functions in the
real world. It simplifies the introduction of Unicode tremendously,
if the multi-byte encoding provided does not introduce spurious
new control character bytes as parts of the new multi-byte sequences,
on which old string processing applications would start to react
in unexpected ways.

Latin-1 did not add de-facto any new characters that are widely used as
control functions. (The soft-hyphen would have been a potential
candidate, but it was never widely implemented enough to justify making
a fuzz about it.)

Therefore, ASCII and Latin-1 backwards compatibility of UTF-8 are two
VERY VERY different issues.

A Latin-1 backwards compatible encoding that features most of the
important properties on UTF-8 would have required to allocate all
multi-byte sequence bytes in the C1 range, which would have provided us
with less coding efficiency. Definitely not worth the trouble.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT