From: Philippe Verdy (firstname.lastname@example.org)
Date: Fri Apr 17 2009 - 04:02:16 CDT
Doug Ewell wrote:
> I have to agree with Asmus on this. Even if the \Uxxxxxxxx
> notation was originally created to get around the
> four-hex-digit limit of \uxxxx, it does imply a 32-bit value.
> Writing \U0000D835\U0000DC1A would strongly imply that two
> characters are being represented, not one. With this
> extended notation, there should be no reason to fall back to
> UTF-16 code units.
I also agree. Except that I would have used the terms "two assigned code
points" instead of "two characters". Here the surrogates are non-characters.
The good question is to if theyshould be accepted. My opinion is that
"\U0000D835\U0000DC1A" should be rejected only because each of the two 32
bit code points are out of the valid ranges for characters.
For the same reason the syntax "\uDC1A" should be rejected if not following
by another matching second surrogate.
In other words, the syntax should be used only to create VALID sequences of
UTF-16 code units or VALID sequences of UTF-32 code units, so that there is
no ambiguity in their conversion into sequences of code points, and back to
code units of any kind.
Also I see absolutely no "waste of space" when using them: the atabase will
actually not use these syntaxes within the storage tables. The syntaxes are
only for the occurrence of constant strings within SQL queries. They don't
affect the data stored in input variables bound to SQL query placeholders,
or in columns of SELECT result sets, except if they are converted and
displayed this way by the SQL client itself.
This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 04:04:58 CDT