Re: Handling of Surrogates

From: Mark Davis ([email protected])
Date: Thu Apr 16 2009 - 18:42:42 CDT

Next message: Bjoern Hoehrmann: "Re: Handling of Surrogates"

Previous message: Asmus Freytag: "Re: Handling of Surrogates"
In reply to: Asmus Freytag: "Re: Handling of Surrogates"
Next in thread: Doug Ewell: "Re: Handling of Surrogates"
Reply: Doug Ewell: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> If you use the definition
> '\uxxxxx' - escape sequence for a UTF-16 code unit
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit

You'll have to stop there, because that isn't the definition typically (or
certainly not universally) used. In practice, these conventions arose as an
adaption as we went from UCS-2 to UTF-16:

\u means the code point and the equivalent code unit (for these cases they
have identical numeric values).

\U means the code point and the equivalent code unit if <= FFFF (for these
cases they have identical numeric values),
and otherwise a code point and the equivalent two paired code units.

While it would be possible to restrict the recognized escapes, best practice
for interoperability is to accept all 7. When generating, as I said, it
would be cleaner to not do A.3 or B.3, and to do B.2 if and only if B.4 is
unavailable.

> Further, you create the problem that illegal UTF-32 can get converted to
legal UTF-32.

These conventions are designed for and typically used for UTF-16 literal
text. And if they are used with other UTFs, they should be interpreted as
representing what they would be in UTF-16. That is, each of the 7 formats I
listed would have the corresponding meaning.

> You now have introduced into your distributed application a way to convert
illegal UTF-32 sequences silently to legal UTF-32 sequences. From a security
point of view, that would give me pause.

First off, essentially nobody uses UTF-32 for interchange, so your example
would have been better as UTF-8 (you can do the same example). Secondly,
yes, this escaping format is based on UTF-16, and thus has some history
behind it. But it doesn't present any significant problem. You can get
well-formed result from an ill-formed source if:

   - If you convert ill-formed UTF-32 or UTF-8 to the escaped form without
   checking for ill-formed source, *OR*
   - If you convert ill-formed UTF-32 or UTF-8 to UTF-16 without checking
   for ill-formed source.

Of course, if I also do a conversion from UTF-32 where I replace surrogate
code points by FFFD, I also get a valid result. The key problem for security
is where I can sneak harmful characters past a gatekeeper. Very few servers
use surrogate characters (or FFFD) as syntax characters ;-)

And yes, I did forget B.3a and B3b, which are also possible.

a \U0000D835\uDC1A
b \uD835\U0000DC1A

Ugly, but the meaning is well-defined.

Mark

On Thu, Apr 16, 2009 at 15:42, Asmus Freytag <[email protected]> wrote:

> On 4/16/2009 2:55 PM, Mark Davis wrote:
>
>> I disagree somewhat, if I understand what you wrote.
>>
> I think that you misunderstood what I wrote.
>
>> When the \u and \U conventions are used:
>>
>> |U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>| ( a )
>> LATIN SMALL LETTER A could be represented as any of:
>>
>> 1. 'a'
>> 2. \u0061
>> 3. \U00000061
>>
>> The use of #3 is a waste of space, but should not be illegal (except where
>> \U is not available).
>>
> I agree completely so far.
>
>> |U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A>| ( 𝐚 )
>> MATHEMATICAL BOLD SMALL A could be represented as any of:
>>
>> 1. '𝐚'
>> 2. \uD835\uDC1A
>> 3. \U0000D835\U0000DC1A
>> 4. \U0001D41A
>>
>> Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are
>> discouraged where \U is available or UTF-16 is not used, but #2 is necessary
>> where \U is not available (eg Java). [Myself, I like \x{...} escaping
>> better, since it is more uniform. Having a terminator allows variable
>> length.]
>>
> OK. Here's where I think it matters how the escapes are defined.
>
> If you use the definition
>
> '\uxxxxx' - escape sequence for a UTF-16 code unit
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
>
> then everything is well-defined. Examples 1, 2. and 4 in your second set
> are clearly legal, and example 3 is clearly not equivalent. Note, that lack
> of equivalence follows from the definition of UTF-32. Just as the
> equivalence between the examples 2 and 3 in the *first* set follows from the
> defintion of UTF-32 and UTF-16.
>
> How would you rigorously define these two styles of escapes, so that
> example #3 (second set) becomes legal? You would have to do something
> complicated like
>
> '\uxxxxx' - escape sequence for a UTF-16 code unit
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code
> unit if xxxxxxxx >= 0x10000, but escape
> sequence for a UTF-16 code unit
> if xxxxxxxx < 0x10000.
>
> To me, that seems unnecessarily convoluted.
>
> Further, you create the problem that illegal UTF-32 can get converted to
> legal UTF-32.
>
> Here's how: Client 1 starts out with illegal UTF-32 containing the sequence
> <0000D835, 0000DC1A>. Assume this gets turned into the escapes
> "\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this
> escaped sequence and interprets it as the single character sequence
> <0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly, client
> 2 would have been able to reject it as illegal UTF-32.
>
> However, now we have client 3, which works in UTF-16 and has data of the
> form <D835, DC1A>. Under your scheme, client 3 has a choice. It can send any
> one of these four sequences of escape sequences containing surrogates
> "\uD835\U0000DC1A"
> "\U0000D835\uDC1A"
> "\U0000D835\U0000DC1A"
> or
> "\uD835\uDC1A"
>
> To the server, the third sequence of escapes matches what client 2 has
> produced starting with an illegal UTF-32 sequence.
>
> You now have introduced into your distributed application a way to convert
> illegal UTF-32 sequences silently to legal UTF-32 sequences. From a security
> point of view, that would give me pause.
>
> A./
>
>>
>> Mark
>>
>>
>>
>> On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>> On 4/16/2009 12:04 PM, Sam Mason wrote:
>>
>> Hi All,
>>
>> I've got myself in a discussion about the correct handling of
>> surrogate
>> pairs. The background is as follows; the Postgres database
>> server[1]
>> currently assumes that the SQL it's receiving is in some user
>> specified
>> encoding, and it's been proposed that it would be nicer to be
>> able to
>> enter Unicode characters directly in the form of escape codes in a
>> similar form to Python, i.e. support would be added for:
>>
>> '\uxxxx'
>> and
>> '\Uxxxxxxxx'
>>
>> The currently proposed patch[2] specifically handles surrogate
>> pairs
>> in the input. For example '\uD800\uDF02' and '\U00010302'
>> would be
>> considered to be valid and identical strings containing
>> exactly one
>> character. I was wondering if this should indeed be
>> considered valid or
>> if an error should be returned instead.
>>
>>
>> As long as there are pairs of the surrogate code points provided
>> as escape sequences, there's an unambiguous relation between each
>> pair and a code point in the supplementary planes. So far, so good.
>>
>> The upside is that the dual escape sequences facilitate conversion
>> to/from UTF-16. Each code unit in UTF-16 can be processed separately.
>>
>> The downside is that you now have two equivalent escape
>> mechanisms, and you can no longer take a string with escape
>> sequences and binarily compare it without bringing it into a
>> canonical form.
>>
>> However, if one is allowed to represent the character "a" both as
>> 'a' and as '\u0061' (which I assume is possible) then there's
>> already a certain ambiguity built into the escape sequence mechanism.
>>
>> What should definitely result in an error is to write '\U0000D800'
>> because the 8-byte form is to be understood as UTF-32, and in that
>> context there would be an issue.
>>
>> So, in short, if the definition of the escapes is as follows
>>
>> '\uxxxxx' - escape sequence for a UTF-16 code point
>>
>> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
>>
>> then everything is fine and predictable. If the definition of the
>> shorter sequence, is instead, "a code point on the BMP" then it's
>> not clear how to handle surrogate pairs.
>>
>> A./
>>
>>
>>
>

Next message: Bjoern Hoehrmann: "Re: Handling of Surrogates"
Previous message: Asmus Freytag: "Re: Handling of Surrogates"
In reply to: Asmus Freytag: "Re: Handling of Surrogates"
Next in thread: Doug Ewell: "Re: Handling of Surrogates"
Reply: Doug Ewell: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 18:44:52 CDT