Re: Handling of Surrogates

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Apr 16 2009 - 15:41:30 CDT

  • Next message: Mark Davis: "Re: Handling of Surrogates"

    On 4/16/2009 1:04 PM, Asmus Freytag wrote:
    > On 4/16/2009 12:04 PM, Sam Mason wrote:
    >> Hi All,
    >>
    >> I've got myself in a discussion about the correct handling of surrogate
    >> pairs. The background is as follows; the Postgres database server[1]
    >> currently assumes that the SQL it's receiving is in some user specified
    >> encoding, and it's been proposed that it would be nicer to be able to
    >> enter Unicode characters directly in the form of escape codes in a
    >> similar form to Python, i.e. support would be added for:
    >>
    >> '\uxxxx'
    >> and
    >> '\Uxxxxxxxx'
    >>
    >> The currently proposed patch[2] specifically handles surrogate pairs
    >> in the input. For example '\uD800\uDF02' and '\U00010302' would be
    >> considered to be valid and identical strings containing exactly one
    >> character. I was wondering if this should indeed be considered valid or
    >> if an error should be returned instead.
    >>
    >>
    > As long as there are pairs of the surrogate code points provided as
    > escape sequences, there's an unambiguous relation between each pair
    > and a code point in the supplementary planes. So far, so good.
    >
    > The upside is that the dual escape sequences facilitate conversion
    > to/from UTF-16. Each code unit in UTF-16 can be processed separately.
    >
    > The downside is that you now have two equivalent escape mechanisms,
    > and you can no longer take a string with escape sequences and binarily
    > compare it without bringing it into a canonical form.
    >
    > However, if one is allowed to represent the character "a" both as 'a'
    > and as '\u0061' (which I assume is possible) then there's already a
    > certain ambiguity built into the escape sequence mechanism.
    >
    > What should definitely result in an error is to write '\U0000D800'
    > because the 8-byte form is to be understood as UTF-32, and in that
    > context there would be an issue.
    >
    > So, in short, if the definition of the escapes is as follows
    >
    > '\uxxxxx' - escape sequence for a UTF-16 code point
    >
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
    recte: code unit in both cases.
    >
    > then everything is fine and predictable. If the definition of the
    > shorter sequence is instead, "a code point on the BMP" then it's not
    > clear how to handle surrogate pairs.
    >
    > A./
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 15:47:13 CDT