Re: Unpaired surrogates

From: Asmus Freytag (t) <>
Date: Tue, 20 Oct 2015 04:29:17 -0700
When it comes to methods operating on buffers there's always the tension between viewing the buffer as text elements vs. as data elements. For some purposes, from error detection to data cleanup you need to be able to treat the buffer as data elements. For many other operations, a focus on text elements is enough.

If you desire to have a regex that you can use to validate a raw buffer, then that regex must do something sensible with partial code points. If you don't have multiple regex engines, then limiting your single one to valid input prevents you from using it everywhere.


On 10/20/2015 3:06 AM, Philippe Verdy wrote:
2015-10-20 2:07 GMT+02:00 Richard Wordingham <>:
Now, as we know, UTF-32 does not handle the full range of Unicode code

??? All valid UTFs handle the full range of valid Unicode code points. This includes UTF-32 as well as UTF-16 and UTF-8 (and their variants).

it only handles scalar values.

??? UTF's allow encoding ANY valid scalar values (which are bijectively associated to a subset of valid code points). However they don't allow encoding surrogates (that are valid code points but not assigned any scalar value, so that they are not valid in any valid UTF).

Visibly you are still confusing code points, code units and scalar values.

In the discussion of UTS#18
RL1.7, my objections did result in the addition of:

"Note: It is permissible, but not required, to match an isolated
surrogate code point (such as \u{D800}), which may occur in Unicode
Strings. See Unicode String in the Unicode glossary."

I'm not sure that that text loosely associated with RL1.7 gets round
Requirement RL1.1, which still reads:

"To meet this requirement, an implementation shall supply a mechanism
for specifying any Unicode code point (from U+0000 to U+10FFFF), using
the hexadecimal code point representation."

I'm also puzzled about how such a regexp will really match some input text if that input text has to be using a valid UTF. The regexp "\u{D800}" will likely match only lone surrogates (in any UTF), not a surrogate with the same value which is paired correctly to encode a supplementary code point.

Note that even with **valid** UTF-8 text, U+D800 cannot occur. But if you remove the "valid" restriction, U+D800 may be present, including before U+DC00, but this won't form a valid pair: these are also lone surrogates in this case (they are paired and encode a supplementary code point, only if the text uses UTF-16
There are no valid surrogate pairs in valid UTF-8 and valid UTF-32, so if surrogates are appearing, they are all "lone" surrogates. If you blindly convert from UTF-8 or UTF-32 to UTF-16, the invalid text could become valid and new valid supplementary code points will appear unexpectedly. That's why lone surrogates cannot be part of any valid UTF, as they break the bijection.

Received on Tue Oct 20 2015 - 06:30:17 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 20 2015 - 06:30:17 CDT