Re: utf-8 != latin-1

From: Mark Davis (mark@macchiato.com)
Date: Tue Oct 17 2000 - 09:24:26 EDT


One of the main features of XML is that it has quite strict rules about how
to handle errors. The goal, I believe, is to ensure that we are not awash in
malformed files that have no clear interpretation.

And this is clearly an error: the acceptable code points are quite clearly
stated:

http://www.w3.org/TR/2000/REC-xml-20001006#dt-character

Converting an illegal UTF-8 sequence into a valid -- BUT WRONG -- sequence
of valid code points is clearly against the intent of this production rule.
XML could have taken the opposite tack -- that illegal code points and
illegal code unit sequences are to be ignored. But it didn't.

Mark

BTW, I have a simple browser-based UTF converter (in Javascript) at
http://www.macchiato.com/unicode/charts.html (click on Converter). It lets
you convert back and forth to different UTFs, with various choices for
format. And, it does checks for illegal UTF-8 sequences!

----- Original Message -----
From: "Doug Ewell" <dewell@compuserve.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Friday, October 13, 2000 21:59
Subject: Re: utf-8 != latin-1

"Steven R. Loomis" <srl@jtcsv.com> wrote:

> What happened was that the sequence AD 63 61 73 was
> interpreted as U+E54E U+DC73..

Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as
anything.

John Cowan's "utf" perl script (which carries the appropriate
disclaimers about no error checking) converts that sequence to U+D94E
U+DC73, which seems a bit more reasonable -- at least it's a complete
surrogate pair.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT