RE: Handling irregular sequences

From: Misha.Wolf@reuters.com
Date: Sun Oct 28 2001 - 13:29:01 EST


Two points in response to the questions:

1. The XML spec has just been amended by an erratum to clarify
    that an irregular UTF-8 sequence must generate a fatal error.

2. It has been agreed that the Unicode Standard will be modified
    to ban irregular UTF-8 sequences for all characters.

Misha

On 28/10/2001 12:52:37 Bernard Miller wrote:
> The question raised earlier by David Hollingsworth did
> not seem to get any responses from this list. I've
> pasted the text of the email below. I would also like
> clarification on why the utf-8 in unicode 3.1 only
> forbids conformant implementations from interpreting
> nonshortest forms for BMP characters --and does not
> forbid interpretation of all irregular sequences for
> all characters.
>
> ___
> Date: 5 Oct 2001 18:23:58 -0000
> From: "David E. Hollingsworth" <deh@fastanimals.com> |
> Block Address | Add to Address Book
> To: unicode@unicode.org
> Subject: Handling irregular sequences
>
> The definition of UTF-32 (and the modifications to
> UTF-8 for Unicode
> 3.1) make it clear that conformant processes shall not
> generate
> irregular sequences. However, they do not (and
> perhaps they
> shouldn't) indicate what a process should do when
> encountering an
> irregular sequence, and I'm curious what people are
> doing in practice.
>
> One could apply the traditional Internet aphorism of
> being liberal in
> what one accepts, but that didn't pan out so well for
> non-shortest-form UTF-8, so in addition to wondering
> what people are
> doing in practice, I'm also curious about the follow
> theoretical
> issue:
>
> It doesn't seem very likely to me that someone would
> write a security
> check that depends on, say, passing Deseret code
> points but blocking
> musical notation code points; however, I wouldn't say
> it's impossible;
> moreover, a security check that wants to disallow all
> non-BMP
> characters doesn't seem quite so outlandish. If
> someone did write
> such a check, it seems to me that the attack described
> in UAX #27
> would apply, by substituting "irregular sequence" for
> "non-shortest
> form":
>
> Process A performs security checks, but does not
> check for irregular
> sequences.
>
> Process B accepts the byte sequence from process A,
> and transforms
> it into UTF-16 while interpreting irregular
> sequences.
>
> The UTF-16 text may then contain characters that
> should have been
> filtered out by process A.
>
>
> Even if I'm mistaken about this, is there a specific
> argument *for*
> accepting irregular sequences?
>
> --deh!
>
> ___
>
> Bernard
>
>
> __________________________________________________
> Do You Yahoo!?
> Make a great connection at Yahoo! Personals.
> http://personals.yahoo.com
>

-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Sun Oct 28 2001 - 14:14:40 EST