Re: Handling of Surrogates

From: Sam Mason (sam@samason.me.uk)
Date: Fri Apr 17 2009 - 06:29:38 CDT

Next message: Sam Mason: "Re: Handling of Surrogates"

Previous message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Dal and sad with 3 dots below"
Maybe in reply to: Sam Mason: "Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, Apr 16, 2009 at 01:04:07PM -0700, Edward Cherlin wrote:
> Which encoding do they propose to specify? UTF-16 with surrogates?
> UCS-4 in 32 bits? Will this allow both? It makes a difference.

As far as I know the idea is to stay away from specifying things like
this as much as is possible. When dealing with Unicode text we use
UTF-8 both internally and externally at the moment.

> (Clearly not UCS-2, which lacks surrogates)

No, PG's never had much to do with UCS-2 as far as I know.

My personal feeling is that we may be dealing with UTF-16 clients (Java,
C# and some builds of Python being the ones that spring to mind) who
may have trouble generating properly escaped literals. Hence it's
probably prudent to support whatever is possible, without introducing
the opportunity for ambiguity.

> I trust that you would return an error for unmatched or mismatched
> surrogates.

Yes, the currently patch does this.

> Surrogate pairs are part of UTF-16 but not UCS-4.

Yes, this was my reason for posting here. I even managed to get
myself confused and say that the FAQs on the Unicode site decreed that
surrogates aren't valid in interchange.

> If this is an input
> conversion option only, and the characters are stored as the resulting
> 32-bit single UCS-4 character codes, fine. Just don't store surrogates
> in the database, please, unless it is very explicitly declared to be
> UTF-16, and extremely well documented. Basically, the No Surprises
> rule.

Yes, the current proposal is to normalise back to a single character.

> Is there any thought of standardizing treatment of Unicode handling in
> SQL? I know that SQL standards have little traction in the market, but
> perhaps this could be an exception.

Sorry; no idea. I'm mainly just a user of Postgres that tends to get in
over my head technical discussions. The original thing that prompted
the current proposal was that one of the core developers noticed[1] that
the SQL-2003 spec defines a method of escaping Unicode characters in
strings.

-- 
  Sam  http://samason.me.uk/
 
 [1] http://archives.postgresql.org/pgsql-hackers/2009-04/msg00601.php

Next message: Sam Mason: "Re: Handling of Surrogates"
Previous message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Dal and sad with 3 dots below"
Maybe in reply to: Sam Mason: "Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 06:33:06 CDT