RE: Handling irregular sequences

From: Misha.Wolf@reuters.com
Date: Sun Oct 28 2001 - 13:29:01 EST

Previous message: Arjun Aggarwal: "Re: Numbers"
Maybe in reply to: David E. Hollingsworth: "Handling irregular sequences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Two points in response to the questions:

1. The XML spec has just been amended by an erratum to clarify
that an irregular UTF-8 sequence must generate a fatal error.

2. It has been agreed that the Unicode Standard will be modified
to ban irregular UTF-8 sequences for all characters.

Misha

On 28/10/2001 12:52:37 Bernard Miller wrote:
> The question raised earlier by David Hollingsworth did
> not seem to get any responses from this list. I've
> pasted the text of the email below. I would also like
> clarification on why the utf-8 in unicode 3.1 only
> forbids conformant implementations from interpreting
> nonshortest forms for BMP characters --and does not
> forbid interpretation of all irregular sequences for
> all characters.
>
> ___
> Date: 5 Oct 2001 18:23:58 -0000
> From: "David E. Hollingsworth" <deh@fastanimals.com> |
> Block Address | Add to Address Book
> To: unicode@unicode.org
> Subject: Handling irregular sequences
>
> The definition of UTF-32 (and the modifications to
> UTF-8 for Unicode
> 3.1) make it clear that conformant processes shall not
> generate
> irregular sequences. However, they do not (and
> perhaps they
> shouldn't) indicate what a process should do when
> encountering an
> irregular sequence, and I'm curious what people are
> doing in practice.
>
> One could apply the traditional Internet aphorism of
> being liberal in
> what one accepts, but that didn't pan out so well for
> non-shortest-form UTF-8, so in addition to wondering
> what people are
> doing in practice, I'm also curious about the follow
> theoretical
> issue:
>
> It doesn't seem very likely to me that someone would
> write a security
> check that depends on, say, passing Deseret code
> points but blocking
> musical notation code points; however, I wouldn't say
> it's impossible;
> moreover, a security check that wants to disallow all
> non-BMP
> characters doesn't seem quite so outlandish. If
> someone did write
> such a check, it seems to me that the attack described
> in UAX #27
> would apply, by substituting "irregular sequence" for
> "non-shortest
> form":
>
> Process A performs security checks, but does not
> check for irregular
> sequences.
>
> Process B accepts the byte sequence from process A,
> and transforms
> it into UTF-16 while interpreting irregular
> sequences.
>
> The UTF-16 text may then contain characters that
> should have been
> filtered out by process A.
>
>
> Even if I'm mistaken about this, is there a specific
> argument *for*
> accepting irregular sequences?
>
> --deh!
>
> ___
>
> Bernard
>
>
> __________________________________________________
> Do You Yahoo!?
> Make a great connection at Yahoo! Personals.
> http://personals.yahoo.com
>

-----------------------------------------------------------------
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.

Previous message: Arjun Aggarwal: "Re: Numbers"
Maybe in reply to: David E. Hollingsworth: "Handling irregular sequences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Sun Oct 28 2001 - 14:14:40 EST