L2/19-049

Title: Proposed Change in GCB Property Value for Surrogate Code Points

Source: Ken Whistler

Date: January 10, 2019

Action: For consideration by the UTC

Background

Back in November, 2018, there was a discussion on the unicore list regarding
the occurrence of unpaired surrogate code points in the segmentation test
file, GraphemeBreakTest.txt. These pose a problem for some test
implementations, because they cannot validly be converted to UTF-8.

Various suggestions were made as to how to fix this to be less problematical
in the test cases in that file, including the replacement of isolated
surrogate code point values (D800) with the replacement character (FFFD).
However, U+FFFD does not currently have precisely the same breaking behavior
as an isolated surrogate code point (when testing UTF-16 data strings), so
the discussion then suggested that perhaps the breaking behavior of isolated
surrogate code points could be converged to that of U+FFFD.

In my opinion, that is not the best direction to take, but there is a fairly
simple way to fix the problem.

Analysis

[quoted from my response in the email thread, November 16, 2018.]

[Aligning properties for isolated surrogates] is more promising, but aligning 
to FFFD is not the best choice, IMO.

The problem for surrogate code points in the segmentation algorithms is that
they are included in gcb=Control, for some reason, rather than being left as
gcb=XX. So you have:

D800: wb=XX, sb=XX, gcb=Control, lb=SG

Whereas for most actual controls, you have the pattern:

200E: wb=Format, sb=Format, gcb=Control, lb=CM

Contrast that with PUA characters:

E000: wb=XX, sb=XX, gcb=XX, lb=XX

I think you would get the best outcome if you simply remove the surrogate
code points from the derivation of gcb=Control, so they would default to
gcb=XX. (Note that in the algorithm, they should then fall through to the
Any ÷ Any rule, and would end up breaking the same.) Then if lb=SG were
treated as lb=XX, rather than just being defined as undefined, any isolated
surrogate code point would, for the purposes of testing, end up just like a
PUA code point for all the segmentation tests. So at that point, you
wouldn't need to actually have D800 values in the test files.

Proposal

Given that analysis, and if no holes are poked in my conclusion during
discussion, I propose the following change for the Grapheme_Cluster_Break
(gcb) property in Unicode 12.0:

Change the assignment of surrogate code points from gcb=Control to gcb=XX
(the default).

Implications

Because GraphemeBreakProperty.txt is generated by the Unicodetools, rather
than being a primary UCD data file, it would require a small (presumably
single-line) change in the tools to remove the range D800..DFFF from the
list which is assigned gcb=Control.

The corresponding test data (GraphemeBreakTest.txt) would also need to be
generated and checked. Given the way test data is auto-generated by the
tooling, the regeneration would presumably update the test data so that no
isolated surrogates would occur in the test data, as their testing would no
longer be significantly different from PUA or reserved code points. However,
the maintainers of the Unicodetools code would need to verify that for the
release.

The implications for linebreaking are a little different. I am not
advocating any change to the Line_Break property value for isolated
surrogates, but some small updates to the text in UAX #14 might be useful.
Those changes are not urgent for 12.0, so they could await some future
update for UAX #14.