RE: Handling of Surrogates

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 17 2009 - 04:02:16 CDT

  • Next message: Johannes Bergerhausen: "more dingbats in plain text"

    Doug Ewell wrote:
    > I have to agree with Asmus on this. Even if the \Uxxxxxxxx
    > notation was originally created to get around the
    > four-hex-digit limit of \uxxxx, it does imply a 32-bit value.
    > Writing \U0000D835\U0000DC1A would strongly imply that two
    > characters are being represented, not one. With this
    > extended notation, there should be no reason to fall back to
    > UTF-16 code units.

    I also agree. Except that I would have used the terms "two assigned code
    points" instead of "two characters". Here the surrogates are non-characters.
    The good question is to if theyshould be accepted. My opinion is that
    "\U0000D835\U0000DC1A" should be rejected only because each of the two 32
    bit code points are out of the valid ranges for characters.

    For the same reason the syntax "\uDC1A" should be rejected if not following
    by another matching second surrogate.

    In other words, the syntax should be used only to create VALID sequences of
    UTF-16 code units or VALID sequences of UTF-32 code units, so that there is
    no ambiguity in their conversion into sequences of code points, and back to
    code units of any kind.

    Also I see absolutely no "waste of space" when using them: the atabase will
    actually not use these syntaxes within the storage tables. The syntaxes are
    only for the occurrence of constant strings within SQL queries. They don't
    affect the data stored in input variables bound to SQL query placeholders,
    or in columns of SELECT result sets, except if they are converted and
    displayed this way by the SQL client itself.



    This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 04:04:58 CDT