Re: Handling of Surrogates

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Apr 16 2009 - 15:04:30 CDT

  • Next message: Peter Zilahy Ingerman, PhD: "Re: Localizable Sentences Experiment"

    On 4/16/2009 12:04 PM, Sam Mason wrote:
    > Hi All,
    >
    > I've got myself in a discussion about the correct handling of surrogate
    > pairs. The background is as follows; the Postgres database server[1]
    > currently assumes that the SQL it's receiving is in some user specified
    > encoding, and it's been proposed that it would be nicer to be able to
    > enter Unicode characters directly in the form of escape codes in a
    > similar form to Python, i.e. support would be added for:
    >
    > '\uxxxx'
    > and
    > '\Uxxxxxxxx'
    >
    > The currently proposed patch[2] specifically handles surrogate pairs
    > in the input. For example '\uD800\uDF02' and '\U00010302' would be
    > considered to be valid and identical strings containing exactly one
    > character. I was wondering if this should indeed be considered valid or
    > if an error should be returned instead.
    >
    >
    As long as there are pairs of the surrogate code points provided as
    escape sequences, there's an unambiguous relation between each pair and
    a code point in the supplementary planes. So far, so good.

    The upside is that the dual escape sequences facilitate conversion
    to/from UTF-16. Each code unit in UTF-16 can be processed separately.

    The downside is that you now have two equivalent escape mechanisms, and
    you can no longer take a string with escape sequences and binarily
    compare it without bringing it into a canonical form.

    However, if one is allowed to represent the character "a" both as 'a'
    and as '\u0061' (which I assume is possible) then there's already a
    certain ambiguity built into the escape sequence mechanism.

    What should definitely result in an error is to write '\U0000D800'
    because the 8-byte form is to be understood as UTF-32, and in that
    context there would be an issue.

    So, in short, if the definition of the escapes is as follows

        '\uxxxxx' - escape sequence for a UTF-16 code point

        '\Uxxxxxxxx' - escape sequence for a UTF-32 code point

    then everything is fine and predictable. If the definition of the
    shorter sequence, is instead, "a code point on the BMP" then it's not
    clear how to handle surrogate pairs.

    A./



    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 15:07:15 CDT