8-bit text which is supposed to be UTF-8 but isn't

From: Erland Sommarskog (sommar@algonet.se)
Date: Sat Jan 29 2000 - 16:14:37 EST


('8-bit' encoding is not supported, stored as-is) It is my understanding that there are many 8-bit sequences which are
not legal UTF-8, and you can easily detect this. For instance, the
string "rńksm÷rgňs" would quickly be revealed as something else than
UTF-8. Possibly you can extrapolate from the general rules, and arrive
at something which normally should be coded some other way, but it is
my recollection from previous discsussions on this list that this is
frowned upon, and an agent which disovers this should barf, or at least
replace the illegal characters with block characters or similar. Am I
right?

The reason I ask, is because I'm on the Usefor list, where we try to
produce a new RFC for Usenet news. We're aiming on saying that text
in article headers is UTF-8. But as many newsreaders today produce
raw 8-bit in Latin-1 or some other charset, we need to take this is
regard to some extent. (Basically, we say: "just pass on, and what
happens at presentation is undefined. If the agent settles on to
present it as, say, Latin-1, that is fine with me.)

This is the BNF which is the draft right now:

    UTF8-xtra-head = %d192-255
    UTF8-xtra-tail = %d128-191
    UTF8-xtra-char = UTF8-xtra-head 1*UTF8-xtra-tail

Is this complete enough? (For the full context see
http://www.landfield.com/usefor/drafts/section_2.02.02, towards the
end.)

Generally, we have been careful to get into too much detail who
UTF-8 will work when it comes to case equivalence and such, as there
is no one on the list with deeper knowledge in the area. We're just
assuming that there will be good libraries that programmers can use.

--
Erland Sommarskog, Stockholm, sommar@algonet.se



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT