RE: Handling of Surrogates

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 17 2009 - 04:37:19 CDT

Next message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Dal and sad with 3 dots below"

Previous message: Johannes Bergerhausen: "more dingbats in plain text"
In reply to: Bjoern Hoehrmann: "Re: Handling of Surrogates"
Next in thread: Sam Mason: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Bjoern Hoehrmann wrote:
> >
> > '\uxxxx'
> >and
> > '\Uxxxxxxxx'
>
> I think you would be better of doing it similar to Perl,
> which uses ex- plicit delimiters for the value. This has a
> number of benefits: you can parse it case-insensitively,
> there is no confusion if you, for example, want U+AFFE
> followed by the literal "AFFE", there is no confusion as to
> what the required length is (some formats allow only six
> digits), it is extendable (with Perl you can also use
> character names, alias names, etc.), and the answer to your
> question is more obvious. Perl generates a warning if you
> specify surrogate code points.

You have not understood the issue. The syntax proposed using \u and \U are
makingclear distinctions, both require a fixed number of hex digits after
them (4 and 8 respectively), without regard to the codepoint value. Both can
accept without any problems or ambiguity hex digits in lowercase or
uppercase forms. So there is no confusion if they are followed by the
litteral "AFFE".

The issues discussed here are not there, but in the use of surrogates, and
in the validity of sequences of code units represented by the syntaxes, and
in the number of valid codepoints that they are representing. For me,
surrogates are not valid codepoints for storage purpose, even if these
codepoints are assigned (to non-characters) in the standard.

But it's true that the syntaxes with \u and \U are quite confusive. And the
delimited approach, like in HTML '&#xNNNNN;' or in '\x{NNNN}', has its
interest: you don't have to represent code units, but only code points, and
there's no need to enforce the number of hex digits.

As long as the hex values fall into the correct ranges for valid codepoints:
U+0000..U+D7FF or U+E000..U+10FFFF (i.e. exactly the same ranges also
accepted by strict UTF-8, or by other compliant UTFs including compressed
ones like BOCU and SCSU).

So the proposed syntaxes should ensure exactly the same validity constraints
on the represented code points, independantly of the code units indirectly
referenced (only locally within strings) by the syntax.

For this reason, if I suppose that I have a table containing a CHAR(1)
column with Unicode capability and I write:
INSERT INTO mytable VALUES('\uD800', ...)
It should fail immediately because there's no way to map it to a single
character.

On the opposite,
INSERT INTO mytable VALUES('\uD835\uDC1A', ...)
or
INSERT INTO mytable VALUES('\U000D401A', ...)
should be accepted as they effectively represent a single character in each
case.

Next message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Dal and sad with 3 dots below"
Previous message: Johannes Bergerhausen: "more dingbats in plain text"
In reply to: Bjoern Hoehrmann: "Re: Handling of Surrogates"
Next in thread: Sam Mason: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 04:39:37 CDT