Re: Handling of Surrogates

From: Asmus Freytag (
Date: Thu Apr 16 2009 - 17:42:48 CDT

  • Next message: Mark Davis: "Re: Handling of Surrogates"

    On 4/16/2009 2:55 PM, Mark Davis wrote:
    > I disagree somewhat, if I understand what you wrote.
    I think that you misunderstood what I wrote.
    > When the \u and \U conventions are used:
    > |U+0061 <>| ( a )
    > LATIN SMALL LETTER A could be represented as any of:
    > 1. 'a'
    > 2. \u0061
    > 3. \U00000061
    > The use of #3 is a waste of space, but should not be illegal (except
    > where \U is not available).
    I agree completely so far.
    > |U+1D41A <>|
    > ( 𝐚 ) MATHEMATICAL BOLD SMALL A could be represented as any of:
    > 1. '𝐚'
    > 2. \uD835\uDC1A
    > 3. \U0000D835\U0000DC1A
    > 4. \U0001D41A
    > Similarly #3 is a waste of space, but should not be illegal. #2 and #3
    > are discouraged where \U is available or UTF-16 is not used, but #2 is
    > necessary where \U is not available (eg Java). [Myself, I like \x{...}
    > escaping better, since it is more uniform. Having a terminator allows
    > variable length.]
    OK. Here's where I think it matters how the escapes are defined.

    If you use the definition

        '\uxxxxx' - escape sequence for a UTF-16 code unit
        '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit

    then everything is well-defined. Examples 1, 2. and 4 in your second set
    are clearly legal, and example 3 is clearly not equivalent. Note, that
    lack of equivalence follows from the definition of UTF-32. Just as the
    equivalence between the examples 2 and 3 in the *first* set follows from
    the defintion of UTF-32 and UTF-16.

    How would you rigorously define these two styles of escapes, so that
    example #3 (second set) becomes legal? You would have to do something
    complicated like

        '\uxxxxx' - escape sequence for a UTF-16 code unit
        '\Uxxxxxxxx' - escape sequence for a UTF-32 code
                              unit if xxxxxxxx >= 0x10000, but escape
                              sequence for a UTF-16 code unit
                              if xxxxxxxx < 0x10000.

    To me, that seems unnecessarily convoluted.

    Further, you create the problem that illegal UTF-32 can get converted to
    legal UTF-32.

    Here's how: Client 1 starts out with illegal UTF-32 containing the
    sequence <0000D835, 0000DC1A>. Assume this gets turned into the escapes
    "\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this
    escaped sequence and interprets it as the single character sequence
    <0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly,
    client 2 would have been able to reject it as illegal UTF-32.

    However, now we have client 3, which works in UTF-16 and has data of the
    form <D835, DC1A>. Under your scheme, client 3 has a choice. It can send
    any one of these four sequences of escape sequences containing surrogates

    To the server, the third sequence of escapes matches what client 2 has
    produced starting with an illegal UTF-32 sequence.

    You now have introduced into your distributed application a way to
    convert illegal UTF-32 sequences silently to legal UTF-32 sequences.
     From a security point of view, that would give me pause.

    > Mark
    > On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <
    > <>> wrote:
    > On 4/16/2009 12:04 PM, Sam Mason wrote:
    > Hi All,
    > I've got myself in a discussion about the correct handling of
    > surrogate
    > pairs. The background is as follows; the Postgres database
    > server[1]
    > currently assumes that the SQL it's receiving is in some user
    > specified
    > encoding, and it's been proposed that it would be nicer to be
    > able to
    > enter Unicode characters directly in the form of escape codes in a
    > similar form to Python, i.e. support would be added for:
    > '\uxxxx'
    > and
    > '\Uxxxxxxxx'
    > The currently proposed patch[2] specifically handles surrogate
    > pairs
    > in the input. For example '\uD800\uDF02' and '\U00010302'
    > would be
    > considered to be valid and identical strings containing
    > exactly one
    > character. I was wondering if this should indeed be
    > considered valid or
    > if an error should be returned instead.
    > As long as there are pairs of the surrogate code points provided
    > as escape sequences, there's an unambiguous relation between each
    > pair and a code point in the supplementary planes. So far, so good.
    > The upside is that the dual escape sequences facilitate conversion
    > to/from UTF-16. Each code unit in UTF-16 can be processed separately.
    > The downside is that you now have two equivalent escape
    > mechanisms, and you can no longer take a string with escape
    > sequences and binarily compare it without bringing it into a
    > canonical form.
    > However, if one is allowed to represent the character "a" both as
    > 'a' and as '\u0061' (which I assume is possible) then there's
    > already a certain ambiguity built into the escape sequence mechanism.
    > What should definitely result in an error is to write '\U0000D800'
    > because the 8-byte form is to be understood as UTF-32, and in that
    > context there would be an issue.
    > So, in short, if the definition of the escapes is as follows
    > '\uxxxxx' - escape sequence for a UTF-16 code point
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
    > then everything is fine and predictable. If the definition of the
    > shorter sequence, is instead, "a code point on the BMP" then it's
    > not clear how to handle surrogate pairs.
    > A./

    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 17:45:36 CDT