RE: Handling of Surrogates

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 17 2009 - 04:37:19 CDT

  • Next message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Dal and sad with 3 dots below"

    Bjoern Hoehrmann wrote:
    > >
    > > '\uxxxx'
    > >and
    > > '\Uxxxxxxxx'
    >
    > I think you would be better of doing it similar to Perl,
    > which uses ex- plicit delimiters for the value. This has a
    > number of benefits: you can parse it case-insensitively,
    > there is no confusion if you, for example, want U+AFFE
    > followed by the literal "AFFE", there is no confusion as to
    > what the required length is (some formats allow only six
    > digits), it is extendable (with Perl you can also use
    > character names, alias names, etc.), and the answer to your
    > question is more obvious. Perl generates a warning if you
    > specify surrogate code points.

    You have not understood the issue. The syntax proposed using \u and \U are
    makingclear distinctions, both require a fixed number of hex digits after
    them (4 and 8 respectively), without regard to the codepoint value. Both can
    accept without any problems or ambiguity hex digits in lowercase or
    uppercase forms. So there is no confusion if they are followed by the
    litteral "AFFE".

    The issues discussed here are not there, but in the use of surrogates, and
    in the validity of sequences of code units represented by the syntaxes, and
    in the number of valid codepoints that they are representing. For me,
    surrogates are not valid codepoints for storage purpose, even if these
    codepoints are assigned (to non-characters) in the standard.

    But it's true that the syntaxes with \u and \U are quite confusive. And the
    delimited approach, like in HTML '&#xNNNNN;' or in '\x{NNNN}', has its
    interest: you don't have to represent code units, but only code points, and
    there's no need to enforce the number of hex digits.

    As long as the hex values fall into the correct ranges for valid codepoints:
    U+0000..U+D7FF or U+E000..U+10FFFF (i.e. exactly the same ranges also
    accepted by strict UTF-8, or by other compliant UTFs including compressed
    ones like BOCU and SCSU).

    So the proposed syntaxes should ensure exactly the same validity constraints
    on the represented code points, independantly of the code units indirectly
    referenced (only locally within strings) by the syntax.

    For this reason, if I suppose that I have a table containing a CHAR(1)
    column with Unicode capability and I write:
    INSERT INTO mytable VALUES('\uD800', ...)
    It should fail immediately because there's no way to map it to a single
    character.

    On the opposite,
    INSERT INTO mytable VALUES('\uD835\uDC1A', ...)
    or
    INSERT INTO mytable VALUES('\U000D401A', ...)
    should be accepted as they effectively represent a single character in each
    case.



    This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 04:39:37 CDT