Re: Handling of Surrogates

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Apr 16 2009 - 18:42:42 CDT

  • Next message: Bjoern Hoehrmann: "Re: Handling of Surrogates"

    > If you use the definition
    > '\uxxxxx' - escape sequence for a UTF-16 code unit
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit

    You'll have to stop there, because that isn't the definition typically (or
    certainly not universally) used. In practice, these conventions arose as an
    adaption as we went from UCS-2 to UTF-16:

    \u means the code point and the equivalent code unit (for these cases they
    have identical numeric values).

    \U means the code point and the equivalent code unit if <= FFFF (for these
    cases they have identical numeric values),
    and otherwise a code point and the equivalent two paired code units.

    While it would be possible to restrict the recognized escapes, best practice
    for interoperability is to accept all 7. When generating, as I said, it
    would be cleaner to not do A.3 or B.3, and to do B.2 if and only if B.4 is
    unavailable.

    > Further, you create the problem that illegal UTF-32 can get converted to
    legal UTF-32.

    These conventions are designed for and typically used for UTF-16 literal
    text. And if they are used with other UTFs, they should be interpreted as
    representing what they would be in UTF-16. That is, each of the 7 formats I
    listed would have the corresponding meaning.

    > You now have introduced into your distributed application a way to convert
    illegal UTF-32 sequences silently to legal UTF-32 sequences. From a security
    point of view, that would give me pause.

    First off, essentially nobody uses UTF-32 for interchange, so your example
    would have been better as UTF-8 (you can do the same example). Secondly,
    yes, this escaping format is based on UTF-16, and thus has some history
    behind it. But it doesn't present any significant problem. You can get
    well-formed result from an ill-formed source if:

       - If you convert ill-formed UTF-32 or UTF-8 to the escaped form without
       checking for ill-formed source, *OR*
       - If you convert ill-formed UTF-32 or UTF-8 to UTF-16 without checking
       for ill-formed source.

    Of course, if I also do a conversion from UTF-32 where I replace surrogate
    code points by FFFD, I also get a valid result. The key problem for security
    is where I can sneak harmful characters past a gatekeeper. Very few servers
    use surrogate characters (or FFFD) as syntax characters ;-)

    And yes, I did forget B.3a and B3b, which are also possible.

      a \U0000D835\uDC1A
      b \uD835\U0000DC1A

    Ugly, but the meaning is well-defined.

    Mark

    On Thu, Apr 16, 2009 at 15:42, Asmus Freytag <asmusf@ix.netcom.com> wrote:

    > On 4/16/2009 2:55 PM, Mark Davis wrote:
    >
    >> I disagree somewhat, if I understand what you wrote.
    >>
    > I think that you misunderstood what I wrote.
    >
    >> When the \u and \U conventions are used:
    >>
    >> |U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>| ( a )
    >> LATIN SMALL LETTER A could be represented as any of:
    >>
    >> 1. 'a'
    >> 2. \u0061
    >> 3. \U00000061
    >>
    >> The use of #3 is a waste of space, but should not be illegal (except where
    >> \U is not available).
    >>
    > I agree completely so far.
    >
    >> |U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A>| ( 𝐚 )
    >> MATHEMATICAL BOLD SMALL A could be represented as any of:
    >>
    >> 1. '𝐚'
    >> 2. \uD835\uDC1A
    >> 3. \U0000D835\U0000DC1A
    >> 4. \U0001D41A
    >>
    >> Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are
    >> discouraged where \U is available or UTF-16 is not used, but #2 is necessary
    >> where \U is not available (eg Java). [Myself, I like \x{...} escaping
    >> better, since it is more uniform. Having a terminator allows variable
    >> length.]
    >>
    > OK. Here's where I think it matters how the escapes are defined.
    >
    > If you use the definition
    >
    > '\uxxxxx' - escape sequence for a UTF-16 code unit
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
    >
    > then everything is well-defined. Examples 1, 2. and 4 in your second set
    > are clearly legal, and example 3 is clearly not equivalent. Note, that lack
    > of equivalence follows from the definition of UTF-32. Just as the
    > equivalence between the examples 2 and 3 in the *first* set follows from the
    > defintion of UTF-32 and UTF-16.
    >
    > How would you rigorously define these two styles of escapes, so that
    > example #3 (second set) becomes legal? You would have to do something
    > complicated like
    >
    > '\uxxxxx' - escape sequence for a UTF-16 code unit
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code
    > unit if xxxxxxxx >= 0x10000, but escape
    > sequence for a UTF-16 code unit
    > if xxxxxxxx < 0x10000.
    >
    > To me, that seems unnecessarily convoluted.
    >
    > Further, you create the problem that illegal UTF-32 can get converted to
    > legal UTF-32.
    >
    > Here's how: Client 1 starts out with illegal UTF-32 containing the sequence
    > <0000D835, 0000DC1A>. Assume this gets turned into the escapes
    > "\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this
    > escaped sequence and interprets it as the single character sequence
    > <0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly, client
    > 2 would have been able to reject it as illegal UTF-32.
    >
    > However, now we have client 3, which works in UTF-16 and has data of the
    > form <D835, DC1A>. Under your scheme, client 3 has a choice. It can send any
    > one of these four sequences of escape sequences containing surrogates
    > "\uD835\U0000DC1A"
    > "\U0000D835\uDC1A"
    > "\U0000D835\U0000DC1A"
    > or
    > "\uD835\uDC1A"
    >
    > To the server, the third sequence of escapes matches what client 2 has
    > produced starting with an illegal UTF-32 sequence.
    >
    > You now have introduced into your distributed application a way to convert
    > illegal UTF-32 sequences silently to legal UTF-32 sequences. From a security
    > point of view, that would give me pause.
    >
    > A./
    >
    >>
    >> Mark
    >>
    >>
    >>
    >> On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com<mailto:
    >> asmusf@ix.netcom.com>> wrote:
    >>
    >> On 4/16/2009 12:04 PM, Sam Mason wrote:
    >>
    >> Hi All,
    >>
    >> I've got myself in a discussion about the correct handling of
    >> surrogate
    >> pairs. The background is as follows; the Postgres database
    >> server[1]
    >> currently assumes that the SQL it's receiving is in some user
    >> specified
    >> encoding, and it's been proposed that it would be nicer to be
    >> able to
    >> enter Unicode characters directly in the form of escape codes in a
    >> similar form to Python, i.e. support would be added for:
    >>
    >> '\uxxxx'
    >> and
    >> '\Uxxxxxxxx'
    >>
    >> The currently proposed patch[2] specifically handles surrogate
    >> pairs
    >> in the input. For example '\uD800\uDF02' and '\U00010302'
    >> would be
    >> considered to be valid and identical strings containing
    >> exactly one
    >> character. I was wondering if this should indeed be
    >> considered valid or
    >> if an error should be returned instead.
    >>
    >>
    >> As long as there are pairs of the surrogate code points provided
    >> as escape sequences, there's an unambiguous relation between each
    >> pair and a code point in the supplementary planes. So far, so good.
    >>
    >> The upside is that the dual escape sequences facilitate conversion
    >> to/from UTF-16. Each code unit in UTF-16 can be processed separately.
    >>
    >> The downside is that you now have two equivalent escape
    >> mechanisms, and you can no longer take a string with escape
    >> sequences and binarily compare it without bringing it into a
    >> canonical form.
    >>
    >> However, if one is allowed to represent the character "a" both as
    >> 'a' and as '\u0061' (which I assume is possible) then there's
    >> already a certain ambiguity built into the escape sequence mechanism.
    >>
    >> What should definitely result in an error is to write '\U0000D800'
    >> because the 8-byte form is to be understood as UTF-32, and in that
    >> context there would be an issue.
    >>
    >> So, in short, if the definition of the escapes is as follows
    >>
    >> '\uxxxxx' - escape sequence for a UTF-16 code point
    >>
    >> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
    >>
    >> then everything is fine and predictable. If the definition of the
    >> shorter sequence, is instead, "a code point on the BMP" then it's
    >> not clear how to handle surrogate pairs.
    >>
    >> A./
    >>
    >>
    >>
    >



    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 18:44:52 CDT