Handling irregular sequences

From: David E. Hollingsworth (deh@fastanimals.com)
Date: Fri Oct 05 2001 - 14:23:58 EDT


The definition of UTF-32 (and the modifications to UTF-8 for Unicode
3.1) make it clear that conformant processes shall not generate
irregular sequences. However, they do not (and perhaps they
shouldn't) indicate what a process should do when encountering an
irregular sequence, and I'm curious what people are doing in practice.

One could apply the traditional Internet aphorism of being liberal in
what one accepts, but that didn't pan out so well for
non-shortest-form UTF-8, so in addition to wondering what people are
doing in practice, I'm also curious about the follow theoretical
issue:

It doesn't seem very likely to me that someone would write a security
check that depends on, say, passing Deseret code points but blocking
musical notation code points; however, I wouldn't say it's impossible;
moreover, a security check that wants to disallow all non-BMP
characters doesn't seem quite so outlandish. If someone did write
such a check, it seems to me that the attack described in UAX #27
would apply, by substituting "irregular sequence" for "non-shortest
form":

  Process A performs security checks, but does not check for irregular
  sequences.

  Process B accepts the byte sequence from process A, and transforms
  it into UTF-16 while interpreting irregular sequences.

  The UTF-16 text may then contain characters that should have been
  filtered out by process A.

Even if I'm mistaken about this, is there a specific argument *for*
accepting irregular sequences?

  --deh!



This archive was generated by hypermail 2.1.2 : Fri Oct 05 2001 - 12:41:18 EDT