Re: Handling of Surrogates

From: Sam Mason (sam@samason.me.uk)
Date: Fri Apr 17 2009 - 06:29:38 CDT

  • Next message: Sam Mason: "Re: Handling of Surrogates"

    On Thu, Apr 16, 2009 at 01:04:07PM -0700, Edward Cherlin wrote:
    > Which encoding do they propose to specify? UTF-16 with surrogates?
    > UCS-4 in 32 bits? Will this allow both? It makes a difference.

    As far as I know the idea is to stay away from specifying things like
    this as much as is possible. When dealing with Unicode text we use
    UTF-8 both internally and externally at the moment.

    > (Clearly not UCS-2, which lacks surrogates)

    No, PG's never had much to do with UCS-2 as far as I know.

    My personal feeling is that we may be dealing with UTF-16 clients (Java,
    C# and some builds of Python being the ones that spring to mind) who
    may have trouble generating properly escaped literals. Hence it's
    probably prudent to support whatever is possible, without introducing
    the opportunity for ambiguity.

    > I trust that you would return an error for unmatched or mismatched
    > surrogates.

    Yes, the currently patch does this.

    > Surrogate pairs are part of UTF-16 but not UCS-4.

    Yes, this was my reason for posting here. I even managed to get
    myself confused and say that the FAQs on the Unicode site decreed that
    surrogates aren't valid in interchange.

    > If this is an input
    > conversion option only, and the characters are stored as the resulting
    > 32-bit single UCS-4 character codes, fine. Just don't store surrogates
    > in the database, please, unless it is very explicitly declared to be
    > UTF-16, and extremely well documented. Basically, the No Surprises
    > rule.

    Yes, the current proposal is to normalise back to a single character.

    > Is there any thought of standardizing treatment of Unicode handling in
    > SQL? I know that SQL standards have little traction in the market, but
    > perhaps this could be an exception.

    Sorry; no idea. I'm mainly just a user of Postgres that tends to get in
    over my head technical discussions. The original thing that prompted
    the current proposal was that one of the core developers noticed[1] that
    the SQL-2003 spec defines a method of escaping Unicode characters in
    strings.

    -- 
      Sam  http://samason.me.uk/
     
     [1] http://archives.postgresql.org/pgsql-hackers/2009-04/msg00601.php
    


    This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 06:33:06 CDT