RE: Handling of Surrogates

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Apr 16 2009 - 17:01:51 CDT

Next message: Asmus Freytag: "Re: Handling of Surrogates"

Previous message: Mark Davis: "Re: Handling of Surrogates"
In reply to: Asmus Freytag: "Re: Handling of Surrogates"
Next in thread: Sam Mason: "Re: Handling of Surrogates"
Reply: Sam Mason: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I don't think that the proposed syntax means that the Unicode characters
need to be stored in the database using this form. It is just meant to be
used within the syntax of SQL requests itself, but is not needed for binding
variables. The database cn then store directly the encoded characters, using
any convenient binary UTF format.

The good question to ask is wht will be the semantic of table definitions
that take count of "characters" in terms of restriction of field length, or
in functions manipulating substrings or performing various operations on
them. The database could still perform a validation for the proposed syntax,
so that it will not accept to store unpaired surrogates (this would generate
a constraint validation error, just like the one that occurs when you
violate a length restriction.

The actual number of bytes taken in the database storage is independant of
the SQL definition of tables (if you use a syntax like "VARCHAR(n)" or
"CHAR(n)", but many RDBMS engines (Oracle, Sybase, Informix, Microsoft SQL
Server) have chosen to define a new type (like "[VAR]NLSCHAR(n)") for
allowing storing sequences of atomic Unicode codepoints rather than sequence
of code units (with [VAR]CHAR(n)). Some engines do not use a separate type,
but allow the database to be created with a global parameter specifying the
encoding used for CHAR, CHAR(n), VARCHAR and VARCHAR(n).

So for me, the proposal makes sense: the syntax will only be used for the
representation text in quoted strings within the SQL request statements. The
escaping mechanism is not new in SQL and one is already needed since even
longer for escaping the surrounding syntaxic quote characters themselves.

Whever you run a SQL query specifying:
SELECT ... FROM... WHERE field='a'
or
SELECT ... FROM... WHERE field='\x61'
or
SELECT ... FROM... WHERE field='\u0061'
or
SELECT ... FROM... WHERE field='\U00000061'

the queries should be run equivalently in all cases, as long as the
character is representable in the target database table storage. The
behavior should be exactly similar to what has been adopted in HTML.

Same thing if you use surrogates with '\u' pairs or a single '\U'.

For historic reasons, C, C++, and Java (this is also true for Python, PHP,
Javascript...) have been quite relaxed about the possibility of storing
arbitrary sequences of 16-bit code units, including those invalid in
Unicode; there are pro's and con's to this approach, but nothing forbids a
database engine to use the relaxed model, as long as the adopted syntax
still allows to add specify additional validy constraints (that's why I
think that it's simpler to use a separate datatype for Unicode codepoints,
independatly of the internal UTF used for them (which may be UTF-8, UTF-16,
or a compressed UTF like BOCU or CESU). But in all cases, the syntax used in
SQL queries is completely independant of the actual storage representation.

I see no rationale in allowing one syntax and treating the others as errors:
just consider that '\U' 32-bit syntax will always map to the same '\u'
pairs: this is true as long as this is NOT indicating a surrogate codepoint
using the 32-bit syntax: you only need to reject '\U0000D800' to
'\U0000D8FF' and '\U00110000' to '\UFFFFFFFF' as invalid SQL syntaxes
because they will break the equivalences. You also need to check that
'\uD800' to '\uDBFF' are correctly paired with a matching '\uDC00' to
'\uDFFF' in the SQL query lexer. There will never be any ambiguity, all will
be predictable.

Handling the lagacy '\xNN' syntax is generally more complex because it is
generally dependant of the encoding used by the SQL client (within its
session parameters at logon time or in subsequently set locale parameters)
when communicating with the SQL server. But '\u' and '\U' syntaxes must be
made independant of these encoding and client locale preferences.

One difficulty will be to specify the behavior of the RDMS engine if the
client connects to it using a client-side encoding that does not support
Unicode: should it be allowed to connet to the database if this is a global
setting or perform queries returning data columns containing Unicoded
characters of the Unicode capability is specified per table or per column?
If it is allowed to perform the query, how can the RDBMS perform the
conversion (using a single substitute, using approximation to other "near"
characters? Or should it return a SQL query execution error (invalidating
the current transaction)? What will happen if the change of encoding means
changing the number of codepoints after the conversion? Should it be the
server that performs the encoding conversion or the client within its local
client libraries or within the supporting libraries of the programming
language or of the client OS?

All these options are possible in Oracle, Sybase, MSSQL and Informix (most
probably in IBM DB2 too, however I'm not an expert of it), they all have
their interest (compatibility and connextivity possible from legacy clients)
and inconveniences (in terms of data constraints and coherence). Generally,
if you allow a database to store Unicode codepoints, you should make sure
that it will store only valid Unicode text, and prepare your clients to
handle Unicode texts as well in a encoding suitable for full compatibility
with Unicode (with full roundtrip compatibility).

I don't see why this should be a new difficulty in Postgres, when it has
already been solved since long in Oracle, Sybase, MS SQL, DB2 and more
recently in MySQL too...

> -----Message d'origine-----
Asmus Freytag wrote:
> On 4/16/2009 12:04 PM, Sam Mason wrote:
> > Hi All,
> >
> > I've got myself in a discussion about the correct handling of
> > surrogate pairs. The background is as follows; the
> Postgres database
> > server[1] currently assumes that the SQL it's receiving is in some
> > user specified encoding, and it's been proposed that it
> would be nicer
> > to be able to enter Unicode characters directly in the form
> of escape
> > codes in a similar form to Python, i.e. support would be added for:
> >
> > '\uxxxx'
> > and
> > '\Uxxxxxxxx'
> >
> > The currently proposed patch[2] specifically handles
> surrogate pairs
> > in the input. For example '\uD800\uDF02' and '\U00010302' would be
> > considered to be valid and identical strings containing exactly one
> > character. I was wondering if this should indeed be
> considered valid
> > or if an error should be returned instead.
> >
> >
> As long as there are pairs of the surrogate code points
> provided as escape sequences, there's an unambiguous relation
> between each pair and a code point in the supplementary
> planes. So far, so good.
>
> The upside is that the dual escape sequences facilitate
> conversion to/from UTF-16. Each code unit in UTF-16 can be
> processed separately.
>
> The downside is that you now have two equivalent escape
> mechanisms, and you can no longer take a string with escape
> sequences and binarily compare it without bringing it into a
> canonical form.
>
> However, if one is allowed to represent the character "a" both as 'a'
> and as '\u0061' (which I assume is possible) then there's
> already a certain ambiguity built into the escape sequence mechanism.
>
> What should definitely result in an error is to write '\U0000D800'
> because the 8-byte form is to be understood as UTF-32, and in
> that context there would be an issue.
>
> So, in short, if the definition of the escapes is as follows
>
> '\uxxxxx' - escape sequence for a UTF-16 code point
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
>
> then everything is fine and predictable. If the definition of
> the shorter sequence, is instead, "a code point on the BMP"
> then it's not clear how to handle surrogate pairs.

Next message: Asmus Freytag: "Re: Handling of Surrogates"
Previous message: Mark Davis: "Re: Handling of Surrogates"
In reply to: Asmus Freytag: "Re: Handling of Surrogates"
Next in thread: Sam Mason: "Re: Handling of Surrogates"
Reply: Sam Mason: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 17:03:55 CDT