RE: Handling of Surrogates

From: Peter Constable (
Date: Fri Apr 17 2009 - 09:34:49 CDT

  • Next message: James Cloos: "Re: Handling of Surrogates"

    For U+1D41A, wouldn’t it matter how #3 gets interpreted? If that gets mapped to into a UTF-32BE byte sequence [0x00,0x00,0xd8,0x35,0x00,0x00,0xdc,0x1a] it seems to me that would not be good.

    If #3 gets mapped into a UTF-32BE byte sequence of [0x00,0x01,0xd4,0x1a], or if it gets mapped into a UTF-16BE byte sequence of [0xd8,0x35,0xdc,0x1a], or the LE or UTF-8 equivalents, then that would be OK.


    From: [] On Behalf Of Mark Davis
    Sent: Thursday, April 16, 2009 2:56 PM
    To: Asmus Freytag
    Cc: Sam Mason;
    Subject: Re: Handling of Surrogates

    I disagree somewhat, if I understand what you wrote. When the \u and \U conventions are used:

    U+0061<> ( a ) LATIN SMALL LETTER A could be represented as any of:

     1. 'a'
     2. \u0061
     3. \U00000061
    The use of #3 is a waste of space, but should not beillegal (except where \U is not available). Eg.\U00000061

    U+1D41A<> ( &#119834; ) MATHEMATICAL BOLD SMALL A could be represented as any of:

     1. '&#119834;'
     2. \uD835\uDC1A
     3. \U0000D835\U0000DC1A
     4. \U0001D41A
    Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are discouraged where \U is available or UTF-16 is not used, but #2 is necessary where \U is not available (eg Java). [Myself, I like \x{...} escaping better, since it is more uniform. Having a terminator allows variable length.]


    On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <<>> wrote:
    On 4/16/2009 12:04 PM, Sam Mason wrote:
    Hi All,

    I've got myself in a discussion about the correct handling of surrogate
    pairs. The background is as follows; the Postgres database server[1]
    currently assumes that the SQL it's receiving is in some user specified
    encoding, and it's been proposed that it would be nicer to be able to
    enter Unicode characters directly in the form of escape codes in a
    similar form to Python, i.e. support would be added for:


    The currently proposed patch[2] specifically handles surrogate pairs
    in the input. For example '\uD800\uDF02' and '\U00010302' would be
    considered to be valid and identical strings containing exactly one
    character. I was wondering if this should indeed be considered valid or
    if an error should be returned instead.

    As long as there are pairs of the surrogate code points provided as escape sequences, there's an unambiguous relation between each pair and a code point in the supplementary planes. So far, so good.

    The upside is that the dual escape sequences facilitate conversion to/from UTF-16. Each code unit in UTF-16 can be processed separately.

    The downside is that you now have two equivalent escape mechanisms, and you can no longer take a string with escape sequences and binarily compare it without bringing it into a canonical form.

    However, if one is allowed to represent the character "a" both as 'a' and as '\u0061' (which I assume is possible) then there's already a certain ambiguity built into the escape sequence mechanism.

    What should definitely result in an error is to write '\U0000D800' because the 8-byte form is to be understood as UTF-32, and in that context there would be an issue.

    So, in short, if the definition of the escapes is as follows

      '\uxxxxx' - escape sequence for a UTF-16 code point

      '\Uxxxxxxxx' - escape sequence for a UTF-32 code point

    then everything is fine and predictable. If the definition of the shorter sequence, is instead, "a code point on the BMP" then it's not clear how to handle surrogate pairs.


    This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 09:36:51 CDT