Re: utf-8 != latin-1

From: Mark Davis (
Date: Tue Oct 17 2000 - 09:24:26 EDT

One of the main features of XML is that it has quite strict rules about how
to handle errors. The goal, I believe, is to ensure that we are not awash in
malformed files that have no clear interpretation.

And this is clearly an error: the acceptable code points are quite clearly

Converting an illegal UTF-8 sequence into a valid -- BUT WRONG -- sequence
of valid code points is clearly against the intent of this production rule.
XML could have taken the opposite tack -- that illegal code points and
illegal code unit sequences are to be ignored. But it didn't.


BTW, I have a simple browser-based UTF converter (in Javascript) at (click on Converter). It lets
you convert back and forth to different UTFs, with various choices for
format. And, it does checks for illegal UTF-8 sequences!

----- Original Message -----
From: "Doug Ewell" <>
To: "Unicode List" <>
Sent: Friday, October 13, 2000 21:59
Subject: Re: utf-8 != latin-1

"Steven R. Loomis" <> wrote:

> What happened was that the sequence AD 63 61 73 was
> interpreted as U+E54E U+DC73..

Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as

John Cowan's "utf" perl script (which carries the appropriate
disclaimers about no error checking) converts that sequence to U+D94E
U+DC73, which seems a bit more reasonable -- at least it's a complete
surrogate pair.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT