RE: Handling of Surrogates

From: Peter Constable (petercon@microsoft.com)
Date: Fri Apr 17 2009 - 09:34:49 CDT

Next message: James Cloos: "Re: Handling of Surrogates"

Previous message: Peter Constable: "RE: more dingbats in plain text"
In reply to: Mark Davis: "Re: Handling of Surrogates"
Next in thread: Philippe Verdy: "RE: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

For U+1D41A, wouldn’t it matter how #3 gets interpreted? If that gets mapped to into a UTF-32BE byte sequence [0x00,0x00,0xd8,0x35,0x00,0x00,0xdc,0x1a] it seems to me that would not be good.

If #3 gets mapped into a UTF-32BE byte sequence of [0x00,0x01,0xd4,0x1a], or if it gets mapped into a UTF-16BE byte sequence of [0xd8,0x35,0xdc,0x1a], or the LE or UTF-8 equivalents, then that would be OK.

Peter

From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Mark Davis
Sent: Thursday, April 16, 2009 2:56 PM
To: Asmus Freytag
Cc: Sam Mason; unicode@unicode.org
Subject: Re: Handling of Surrogates

I disagree somewhat, if I understand what you wrote. When the \u and \U conventions are used:

U+0061<http://unicode.org/cldr/utility/character.jsp?a=0061> ( a ) LATIN SMALL LETTER A could be represented as any of:

1. 'a'
2. \u0061
3. \U00000061
The use of #3 is a waste of space, but should not beillegal (except where \U is not available). Eg. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\U00000061

U+1D41A<http://unicode.org/cldr/utility/character.jsp?a=1D41A> ( 𝐚 ) MATHEMATICAL BOLD SMALL A could be represented as any of:

1. '𝐚'
2. \uD835\uDC1A
3. \U0000D835\U0000DC1A
4. \U0001D41A
Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are discouraged where \U is available or UTF-16 is not used, but #2 is necessary where \U is not available (eg Java). [Myself, I like \x{...} escaping better, since it is more uniform. Having a terminator allows variable length.]

Mark

On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com<mailto:asmusf@ix.netcom.com>> wrote:
On 4/16/2009 12:04 PM, Sam Mason wrote:
Hi All,

I've got myself in a discussion about the correct handling of surrogate
pairs. The background is as follows; the Postgres database server[1]
currently assumes that the SQL it's receiving is in some user specified
encoding, and it's been proposed that it would be nicer to be able to
enter Unicode characters directly in the form of escape codes in a
similar form to Python, i.e. support would be added for:

'\uxxxx'
and
'\Uxxxxxxxx'

The currently proposed patch[2] specifically handles surrogate pairs
in the input. For example '\uD800\uDF02' and '\U00010302' would be
considered to be valid and identical strings containing exactly one
character. I was wondering if this should indeed be considered valid or
if an error should be returned instead.

As long as there are pairs of the surrogate code points provided as escape sequences, there's an unambiguous relation between each pair and a code point in the supplementary planes. So far, so good.

The upside is that the dual escape sequences facilitate conversion to/from UTF-16. Each code unit in UTF-16 can be processed separately.

The downside is that you now have two equivalent escape mechanisms, and you can no longer take a string with escape sequences and binarily compare it without bringing it into a canonical form.

However, if one is allowed to represent the character "a" both as 'a' and as '\u0061' (which I assume is possible) then there's already a certain ambiguity built into the escape sequence mechanism.

What should definitely result in an error is to write '\U0000D800' because the 8-byte form is to be understood as UTF-32, and in that context there would be an issue.

So, in short, if the definition of the escapes is as follows

'\uxxxxx' - escape sequence for a UTF-16 code point

'\Uxxxxxxxx' - escape sequence for a UTF-32 code point

then everything is fine and predictable. If the definition of the shorter sequence, is instead, "a code point on the BMP" then it's not clear how to handle surrogate pairs.

A./

Next message: James Cloos: "Re: Handling of Surrogates"
Previous message: Peter Constable: "RE: more dingbats in plain text"
In reply to: Mark Davis: "Re: Handling of Surrogates"
Next in thread: Philippe Verdy: "RE: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Apr 17 2009 - 09:36:51 CDT