Re: Handling of Surrogates

[email protected]

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
  ----- Original Message ----- 
  From: Mark Davis
  To: Asmus Freytag
  Cc: Sam Mason ; [email protected]
  Sent: Thursday, April 16, 2009 17:42
  Subject: Re: Handling of Surrogates
  > If you use the definition
  >   '\uxxxxx' - escape sequence for a UTF-16 code unit
  >  '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
  You'll have to stop there, because that isn't the definition typically 
(or certainly not universally) used. In practice, these conventions 
arose as an adaption as we went from UCS-2 to UTF-16:
  \u means the code point and the equivalent code unit (for these cases 
they have identical numeric values).
  \U means the code point and the equivalent code unit if <= FFFF (for 
these cases they have identical numeric values),
  and otherwise a code point and the equivalent two paired code units.
  While it would be possible to restrict the recognized escapes, best 
practice for interoperability is to accept all 7. When generating, as I 
said, it would be cleaner to not do A.3 or B.3, and to do B.2 if and 
only if B.4 is unavailable.
  > Further, you create the problem that illegal UTF-32 can get 
converted to legal UTF-32.
  These conventions are designed for and typically used for UTF-16 
literal text. And if they are used with other UTFs, they should be 
interpreted as representing what they would be in UTF-16. That is, each 
of the 7 formats I listed would have the corresponding meaning.
  > You now have introduced into your distributed application a way to 
convert illegal UTF-32 sequences silently to legal UTF-32 sequences. 
From a security point of view, that would give me pause.
  First off, essentially nobody uses UTF-32 for interchange, so your 
example would have been better as UTF-8 (you can do the same example). 
Secondly, yes, this escaping format is based on UTF-16, and thus has 
some history behind it. But it doesn't present any significant problem. 
You can get well-formed result from an ill-formed source if:
    a.. If you convert ill-formed UTF-32 or UTF-8 to the escaped form 
without checking for ill-formed source, OR
    b.. If you convert ill-formed UTF-32 or UTF-8 to UTF-16 without 
checking for ill-formed source.
  Of course, if I also do a conversion from UTF-32 where I replace 
surrogate code points by FFFD, I also get a valid result. The key 
problem for security is where I can sneak harmful characters past a 
gatekeeper. Very few servers use surrogate characters (or FFFD) as 
syntax characters ;-)
  And yes, I did forget B.3a and B3b, which are also possible.
    a \U0000D835\uDC1A
    b \uD835\U0000DC1A
  Ugly, but the meaning is well-defined.
  Mark
  On Thu, Apr 16, 2009 at 15:42, Asmus Freytag <[email protected]> 
wrote:
    On 4/16/2009 2:55 PM, Mark Davis wrote:
      I disagree somewhat, if I understand what you wrote.
    I think that you misunderstood what I wrote.
      When the \u and \U conventions are used:
      |U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>| 
( a ) LATIN SMALL LETTER A could be represented as any of:
        1. 'a'
        2. \u0061
        3. \U00000061
      The use of #3 is a waste of space, but should not be illegal 
(except where \U is not available).
    I agree completely so far.
      |U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A>| 
( 𝐚 ) MATHEMATICAL BOLD SMALL A could be represented as any of:
        1. '𝐚'
        2. \uD835\uDC1A
        3. \U0000D835\U0000DC1A
        4. \U0001D41A
      Similarly #3 is a waste of space, but should not be illegal. #2 
and #3 are discouraged where \U is available or UTF-16 is not used, but 
#2 is necessary where \U is not available (eg Java). [Myself, I like 
\x{...} escaping better, since it is more uniform. Having a terminator 
allows variable length.]
    OK. Here's where I think it matters how the escapes are defined.
    If you use the definition
      '\uxxxxx' - escape sequence for a UTF-16 code unit
      '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
    then everything is well-defined. Examples 1, 2. and 4 in your second 
set are clearly legal, and example 3 is clearly not equivalent. Note, 
that lack of equivalence follows from the definition of UTF-32. Just as 
the equivalence between the examples 2 and 3 in the *first* set follows 
from the defintion of UTF-32 and UTF-16.
    How would you rigorously define these two styles of escapes, so that 
example #3 (second set) becomes legal? You would have to do something 
complicated like
      '\uxxxxx' - escape sequence for a UTF-16 code unit
      '\Uxxxxxxxx' - escape sequence for a UTF-32 code
                            unit if xxxxxxxx >= 0x10000, but escape
                            sequence for a UTF-16 code unit
                            if xxxxxxxx < 0x10000.
    To me, that seems unnecessarily convoluted.
    Further, you create the problem that illegal UTF-32 can get 
converted to legal UTF-32.
    Here's how: Client 1 starts out with illegal UTF-32 containing the 
sequence <0000D835, 0000DC1A>. Assume this gets turned into the escapes 
"\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this 
escaped sequence and interprets it as the single character sequence 
<0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly, 
client 2 would have been able to reject it as illegal UTF-32.
    However, now we have client 3, which works in UTF-16 and has data of 
the form <D835, DC1A>. Under your scheme, client 3 has a choice. It can 
send any one of these four sequences of escape sequences containing 
surrogates
    "\uD835\U0000DC1A"
    "\U0000D835\uDC1A"
    "\U0000D835\U0000DC1A"
    or
    "\uD835\uDC1A"
    To the server, the third sequence of escapes matches what client 2 
has produced starting with an illegal UTF-32 sequence.
    You now have introduced into your distributed application a way to 
convert illegal UTF-32 sequences silently to legal UTF-32 sequences. 
From a security point of view, that would give me pause.
    A./
      Mark
      On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <[email protected] 
<mailto:[email protected]>> wrote:
         On 4/16/2009 12:04 PM, Sam Mason wrote:
             Hi All,
             I've got myself in a discussion about the correct handling 
of
             surrogate
             pairs.  The background is as follows; the Postgres database
             server[1]
             currently assumes that the SQL it's receiving is in some 
user
             specified
             encoding, and it's been proposed that it would be nicer to 
be
             able to
             enter Unicode characters directly in the form of escape 
codes in a
             similar form to Python, i.e. support would be added for:
              '\uxxxx'
             and
              '\Uxxxxxxxx'
             The currently proposed patch[2] specifically handles 
surrogate
             pairs
             in the input.  For example '\uD800\uDF02' and '\U00010302'
             would be
             considered to be valid and identical strings containing
             exactly one
             character.  I was wondering if this should indeed be
             considered valid or
             if an error should be returned instead.
         As long as there are pairs of the surrogate code points 
provided
         as escape sequences, there's an unambiguous relation between 
each
         pair and a code point in the supplementary planes. So far, so 
good.
         The upside is that the dual escape sequences facilitate 
conversion
         to/from UTF-16. Each code unit in UTF-16 can be processed 
separately.
         The downside is that you now have two equivalent escape
         mechanisms, and you can no longer take a string with escape
         sequences and binarily compare it without bringing it into a
         canonical form.
         However, if one is allowed to represent the character "a" both 
as
         'a' and as '\u0061' (which I assume is possible) then there's
         already a certain ambiguity built into the escape sequence 
mechanism.
         What should definitely result in an error is to write 
'\U0000D800'
         because the 8-byte form is to be understood as UTF-32, and in 
that
         context there would be an issue.
         So, in short, if the definition of the escapes is as follows
           '\uxxxxx' - escape sequence for a UTF-16 code point
           '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
         then everything is fine and predictable. If the definition of 
the
         shorter sequence, is instead, "a code point on the BMP" then 
it's
         not clear how to handle surrogate pairs.
         A./