Re: Handling of Surrogates

From: Sam Mason (
Date: Fri Apr 17 2009 - 06:55:05 CDT

  • Next message: Sam Mason: "Re: Handling of Surrogates"

    On Fri, Apr 17, 2009 at 12:01:51AM +0200, Philippe Verdy wrote:
    > The good question to ask is wht will be the semantic of table definitions
    > that take count of "characters" in terms of restriction of field length, or
    > in functions manipulating substrings or performing various operations on
    > them. The database could still perform a validation for the proposed syntax,
    > so that it will not accept to store unpaired surrogates (this would generate
    > a constraint validation error, just like the one that occurs when you
    > violate a length restriction.

    PG currently does things with "characters" and not bytes as other
    databases tend to do. For example; at the moment I can do:

      CREATE TABLE foo ( c varchar(1) );

      INSERT INTO foo VALUES (E'\xF0\x90\x8C\x82');

    The "E''" is about saying that we're being non-standard and wanting to
    use C style escapes, which SQL doesn't have. If I then run:

      SELECT length(c) FROM foo;

    I get back "1", saying there's a single character.

    > The actual number of bytes taken in the database storage is independant of
    > the SQL definition of tables (if you use a syntax like "VARCHAR(n)" or
    > "CHAR(n)", but many RDBMS engines (Oracle, Sybase, Informix, Microsoft SQL
    > Server) have chosen to define a new type (like "[VAR]NLSCHAR(n)") for
    > allowing storing sequences of atomic Unicode codepoints rather than sequence
    > of code units (with [VAR]CHAR(n)). Some engines do not use a separate type,
    > but allow the database to be created with a global parameter specifying the
    > encoding used for CHAR, CHAR(n), VARCHAR and VARCHAR(n).

    PG doesn't force you to do anything "special" to handle this.

    > Whever you run a SQL query specifying:
    > SELECT ... FROM... WHERE field='a'
    > or
    > SELECT ... FROM... WHERE field='\x61'
    > or
    > SELECT ... FROM... WHERE field='\u0061'
    > or
    > SELECT ... FROM... WHERE field='\U00000061'
    > the queries should be run equivalently in all cases, as long as the
    > character is representable in the target database table storage. The
    > behavior should be exactly similar to what has been adopted in HTML.

    Yes, that's as I understand it would all work.

    > Handling the lagacy '\xNN' syntax is generally more complex because it is
    > generally dependant of the encoding used by the SQL client (within its
    > session parameters at logon time or in subsequently set locale parameters)
    > when communicating with the SQL server. But '\u' and '\U' syntaxes must be
    > made independant of these encoding and client locale preferences.

    Yes, we have a "client_encoding" parameter (separate from the database's
    encoding) that's used when talking to the client. \u and \U are
    currently only proposed to work when the server's encoding is set to
    UTF-8. PG already knows how to translate between the various server
    encodings and client encodings automatically.

    > One difficulty will be to specify the behavior of the RDMS engine if the
    > client connects to it using a client-side encoding that does not support
    > Unicode: should it be allowed to connet to the database if this is a global
    > setting or perform queries returning data columns containing Unicoded
    > characters of the Unicode capability is specified per table or per column?

    I think I've answered this above; is this resolved?

    > I don't see why this should be a new difficulty in Postgres, when it has
    > already been solved since long in Oracle, Sybase, MS SQL, DB2 and more
    > recently in MySQL too...

    As always, the devil is in the details. UTF-8 has been supported in
    PG for a long time; the issue is about supporting SQL-2003 features
    that (as far as I can determine) only DB2 supports, MySQL has a TODO
    item for the feature posted for version 7.1. PG got support for it in
    October last year, but Tom recently realised it opened the database up
    to various attacks and was thinking about ripping it out again until
    these could be resolved. We think these issues are now resolved, but
    other questions have been opened and hence my queries here.


    This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 06:56:34 CDT