Re: Handling of Surrogates

From: Doug Ewell (doug@ewellic.org)
Date: Thu Apr 16 2009 - 22:56:36 CDT

  • Next message: Mark Davis: "Re: Handling of Surrogates"

    I have to agree with Asmus on this. Even if the \Uxxxxxxxx notation was
    originally created to get around the four-hex-digit limit of \uxxxx, it
    does imply a 32-bit value. Writing \U0000D835\U0000DC1A would strongly
    imply that two characters are being represented, not one. With this
    extended notation, there should be no reason to fall back to UTF-16 code
    units.

    --
    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
      ----- Original Message ----- 
      From: Mark Davis
      To: Asmus Freytag
      Cc: Sam Mason ; unicode@unicode.org
      Sent: Thursday, April 16, 2009 17:42
      Subject: Re: Handling of Surrogates
      > If you use the definition
      >   '\uxxxxx' - escape sequence for a UTF-16 code unit
      >  '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
      You'll have to stop there, because that isn't the definition typically 
    (or certainly not universally) used. In practice, these conventions 
    arose as an adaption as we went from UCS-2 to UTF-16:
      \u means the code point and the equivalent code unit (for these cases 
    they have identical numeric values).
      \U means the code point and the equivalent code unit if <= FFFF (for 
    these cases they have identical numeric values),
      and otherwise a code point and the equivalent two paired code units.
      While it would be possible to restrict the recognized escapes, best 
    practice for interoperability is to accept all 7. When generating, as I 
    said, it would be cleaner to not do A.3 or B.3, and to do B.2 if and 
    only if B.4 is unavailable.
      > Further, you create the problem that illegal UTF-32 can get 
    converted to legal UTF-32.
      These conventions are designed for and typically used for UTF-16 
    literal text. And if they are used with other UTFs, they should be 
    interpreted as representing what they would be in UTF-16. That is, each 
    of the 7 formats I listed would have the corresponding meaning.
      > You now have introduced into your distributed application a way to 
    convert illegal UTF-32 sequences silently to legal UTF-32 sequences. 
    From a security point of view, that would give me pause.
      First off, essentially nobody uses UTF-32 for interchange, so your 
    example would have been better as UTF-8 (you can do the same example). 
    Secondly, yes, this escaping format is based on UTF-16, and thus has 
    some history behind it. But it doesn't present any significant problem. 
    You can get well-formed result from an ill-formed source if:
        a.. If you convert ill-formed UTF-32 or UTF-8 to the escaped form 
    without checking for ill-formed source, OR
        b.. If you convert ill-formed UTF-32 or UTF-8 to UTF-16 without 
    checking for ill-formed source.
      Of course, if I also do a conversion from UTF-32 where I replace 
    surrogate code points by FFFD, I also get a valid result. The key 
    problem for security is where I can sneak harmful characters past a 
    gatekeeper. Very few servers use surrogate characters (or FFFD) as 
    syntax characters ;-)
      And yes, I did forget B.3a and B3b, which are also possible.
        a \U0000D835\uDC1A
        b \uD835\U0000DC1A
      Ugly, but the meaning is well-defined.
      Mark
      On Thu, Apr 16, 2009 at 15:42, Asmus Freytag <asmusf@ix.netcom.com> 
    wrote:
        On 4/16/2009 2:55 PM, Mark Davis wrote:
          I disagree somewhat, if I understand what you wrote.
        I think that you misunderstood what I wrote.
          When the \u and \U conventions are used:
          |U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>| 
    ( a ) LATIN SMALL LETTER A could be represented as any of:
            1. 'a'
            2. \u0061
            3. \U00000061
          The use of #3 is a waste of space, but should not be illegal 
    (except where \U is not available).
        I agree completely so far.
          |U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A>| 
    ( 𝐚 ) MATHEMATICAL BOLD SMALL A could be represented as any of:
            1. '𝐚'
            2. \uD835\uDC1A
            3. \U0000D835\U0000DC1A
            4. \U0001D41A
          Similarly #3 is a waste of space, but should not be illegal. #2 
    and #3 are discouraged where \U is available or UTF-16 is not used, but 
    #2 is necessary where \U is not available (eg Java). [Myself, I like 
    \x{...} escaping better, since it is more uniform. Having a terminator 
    allows variable length.]
        OK. Here's where I think it matters how the escapes are defined.
        If you use the definition
          '\uxxxxx' - escape sequence for a UTF-16 code unit
          '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
        then everything is well-defined. Examples 1, 2. and 4 in your second 
    set are clearly legal, and example 3 is clearly not equivalent. Note, 
    that lack of equivalence follows from the definition of UTF-32. Just as 
    the equivalence between the examples 2 and 3 in the *first* set follows 
    from the defintion of UTF-32 and UTF-16.
        How would you rigorously define these two styles of escapes, so that 
    example #3 (second set) becomes legal? You would have to do something 
    complicated like
          '\uxxxxx' - escape sequence for a UTF-16 code unit
          '\Uxxxxxxxx' - escape sequence for a UTF-32 code
                                unit if xxxxxxxx >= 0x10000, but escape
                                sequence for a UTF-16 code unit
                                if xxxxxxxx < 0x10000.
        To me, that seems unnecessarily convoluted.
        Further, you create the problem that illegal UTF-32 can get 
    converted to legal UTF-32.
        Here's how: Client 1 starts out with illegal UTF-32 containing the 
    sequence <0000D835, 0000DC1A>. Assume this gets turned into the escapes 
    "\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this 
    escaped sequence and interprets it as the single character sequence 
    <0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly, 
    client 2 would have been able to reject it as illegal UTF-32.
        However, now we have client 3, which works in UTF-16 and has data of 
    the form <D835, DC1A>. Under your scheme, client 3 has a choice. It can 
    send any one of these four sequences of escape sequences containing 
    surrogates
        "\uD835\U0000DC1A"
        "\U0000D835\uDC1A"
        "\U0000D835\U0000DC1A"
        or
        "\uD835\uDC1A"
        To the server, the third sequence of escapes matches what client 2 
    has produced starting with an illegal UTF-32 sequence.
        You now have introduced into your distributed application a way to 
    convert illegal UTF-32 sequences silently to legal UTF-32 sequences. 
    From a security point of view, that would give me pause.
        A./
          Mark
          On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com 
    <mailto:asmusf@ix.netcom.com>> wrote:
             On 4/16/2009 12:04 PM, Sam Mason wrote:
                 Hi All,
                 I've got myself in a discussion about the correct handling 
    of
                 surrogate
                 pairs.  The background is as follows; the Postgres database
                 server[1]
                 currently assumes that the SQL it's receiving is in some 
    user
                 specified
                 encoding, and it's been proposed that it would be nicer to 
    be
                 able to
                 enter Unicode characters directly in the form of escape 
    codes in a
                 similar form to Python, i.e. support would be added for:
                  '\uxxxx'
                 and
                  '\Uxxxxxxxx'
                 The currently proposed patch[2] specifically handles 
    surrogate
                 pairs
                 in the input.  For example '\uD800\uDF02' and '\U00010302'
                 would be
                 considered to be valid and identical strings containing
                 exactly one
                 character.  I was wondering if this should indeed be
                 considered valid or
                 if an error should be returned instead.
             As long as there are pairs of the surrogate code points 
    provided
             as escape sequences, there's an unambiguous relation between 
    each
             pair and a code point in the supplementary planes. So far, so 
    good.
             The upside is that the dual escape sequences facilitate 
    conversion
             to/from UTF-16. Each code unit in UTF-16 can be processed 
    separately.
             The downside is that you now have two equivalent escape
             mechanisms, and you can no longer take a string with escape
             sequences and binarily compare it without bringing it into a
             canonical form.
             However, if one is allowed to represent the character "a" both 
    as
             'a' and as '\u0061' (which I assume is possible) then there's
             already a certain ambiguity built into the escape sequence 
    mechanism.
             What should definitely result in an error is to write 
    '\U0000D800'
             because the 8-byte form is to be understood as UTF-32, and in 
    that
             context there would be an issue.
             So, in short, if the definition of the escapes is as follows
               '\uxxxxx' - escape sequence for a UTF-16 code point
               '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
             then everything is fine and predictable. If the definition of 
    the
             shorter sequence, is instead, "a code point on the BMP" then 
    it's
             not clear how to handle surrogate pairs.
             A./
    


    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 22:59:22 CDT