Re: Handling of Surrogates

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Apr 16 2009 - 16:55:39 CDT

  • Next message: Philippe Verdy: "RE: Handling of Surrogates"

    I disagree somewhat, if I understand what you wrote. When the \u and \U
    conventions are used:

    U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061> ( a ) LATIN
    SMALL LETTER A could be represented as any of:

       1. 'a'
       2. \u0061
       3. \U00000061

    The use of #3 is a waste of space, but should not beillegal (except where \U
    is not available). Eg.
    http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\U00000061

    U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A> ( 𝐚 )
    MATHEMATICAL BOLD SMALL A could be represented as any of:

       1. '𝐚'
       2. \uD835\uDC1A
       3. \U0000D835\U0000DC1A
       4. \U0001D41A

    Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are
    discouraged where \U is available or UTF-16 is not used, but #2 is necessary
    where \U is not available (eg Java). [Myself, I like \x{...} escaping
    better, since it is more uniform. Having a terminator allows variable
    length.]

    Mark

    On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com> wrote:

    > On 4/16/2009 12:04 PM, Sam Mason wrote:
    >
    >> Hi All,
    >>
    >> I've got myself in a discussion about the correct handling of surrogate
    >> pairs. The background is as follows; the Postgres database server[1]
    >> currently assumes that the SQL it's receiving is in some user specified
    >> encoding, and it's been proposed that it would be nicer to be able to
    >> enter Unicode characters directly in the form of escape codes in a
    >> similar form to Python, i.e. support would be added for:
    >>
    >> '\uxxxx'
    >> and
    >> '\Uxxxxxxxx'
    >>
    >> The currently proposed patch[2] specifically handles surrogate pairs
    >> in the input. For example '\uD800\uDF02' and '\U00010302' would be
    >> considered to be valid and identical strings containing exactly one
    >> character. I was wondering if this should indeed be considered valid or
    >> if an error should be returned instead.
    >>
    >>
    >>
    > As long as there are pairs of the surrogate code points provided as escape
    > sequences, there's an unambiguous relation between each pair and a code
    > point in the supplementary planes. So far, so good.
    >
    > The upside is that the dual escape sequences facilitate conversion to/from
    > UTF-16. Each code unit in UTF-16 can be processed separately.
    >
    > The downside is that you now have two equivalent escape mechanisms, and you
    > can no longer take a string with escape sequences and binarily compare it
    > without bringing it into a canonical form.
    >
    > However, if one is allowed to represent the character "a" both as 'a' and
    > as '\u0061' (which I assume is possible) then there's already a certain
    > ambiguity built into the escape sequence mechanism.
    >
    > What should definitely result in an error is to write '\U0000D800' because
    > the 8-byte form is to be understood as UTF-32, and in that context there
    > would be an issue.
    >
    > So, in short, if the definition of the escapes is as follows
    >
    > '\uxxxxx' - escape sequence for a UTF-16 code point
    >
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
    >
    > then everything is fine and predictable. If the definition of the shorter
    > sequence, is instead, "a code point on the BMP" then it's not clear how to
    > handle surrogate pairs.
    >
    > A./
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 16:57:51 CDT