Re: 8-bit text which is supposed to be UTF-8 but isn't

From: John Cowan (
Date: Sat Jan 29 2000 - 13:49:57 EST

Erland Sommarskog scripsit:

> [A]n agent which discovers this should barf, or at least
> replace the illegal characters with block characters or similar. Am I
> right?

Yes, in general, but circumstances alter cases. Plan9, for example,
maps ill-formed UTF-8 byte sequences to the defined-but-unused
character U+0080.

> (Basically, we say: "just pass on, and what
> happens at presentation is undefined. If the agent settles on to
> present it as, say, Latin-1, that is fine with me.)

Reasonable for Usenet news, where most content is read by human

> This is the BNF which is the draft right now:
> UTF8-xtra-head = %d192-255
> UTF8-xtra-tail = %d128-191
> UTF8-xtra-char = UTF8-xtra-head 1*UTF8-xtra-tail

Here's a more precise version:

        UTF-8-xtra-head-2 = %d192-223
        UTF-8-xtra-head-3 = %d224-239
        UTF-8-xtra-head-4 = %d240-247
        UTF8-xtra-tail = %d128-191
        UTF8-xtra-char = UTF8-xtra-head-2 UTF8-xtra-tail
                          | UTF8-xtra-head-3 2*UTF8-xtra-tail
                          | UTF8-xtra-head-4 3*UTF8-xtra-tail

Bytes %d247-253 are technically legal but will never be needed,
as Unicode/ISO 10646 will never grow beyond hex 0010FFFF except for
deprecated additional private-use zones that predate Unicode,
and bytes %254-255 are outright illegal.

> Generally, we have been careful to get into too much detail who
> UTF-8 will work when it comes to case equivalence and such, as there
> is no one on the list with deeper knowledge in the area. We're just
> assuming that there will be good libraries that programmers can use.

IMHO (and other i18n type will probably agree), case folding is a bad
idea in general. For backward compatibility, only case-fold the
ASCII characters, and leave the others alone.

John Cowan                         
       I am a member of a civilization. --David Brin

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT