L2/19-049 Title: Proposed Change in GCB Property Value for Surrogate Code Points Source: Ken Whistler Date: January 10, 2019 Action: For consideration by the UTC Background Back in November, 2018, there was a discussion on the unicore list regarding the occurrence of unpaired surrogate code points in the segmentation test file, GraphemeBreakTest.txt. These pose a problem for some test implementations, because they cannot validly be converted to UTF-8. Various suggestions were made as to how to fix this to be less problematical in the test cases in that file, including the replacement of isolated surrogate code point values (D800) with the replacement character (FFFD). However, U+FFFD does not currently have precisely the same breaking behavior as an isolated surrogate code point (when testing UTF-16 data strings), so the discussion then suggested that perhaps the breaking behavior of isolated surrogate code points could be converged to that of U+FFFD. In my opinion, that is not the best direction to take, but there is a fairly simple way to fix the problem. Analysis [quoted from my response in the email thread, November 16, 2018.] [Aligning properties for isolated surrogates] is more promising, but aligning to FFFD is not the best choice, IMO. The problem for surrogate code points in the segmentation algorithms is that they are included in gcb=Control, for some reason, rather than being left as gcb=XX. So you have: D800: wb=XX, sb=XX, gcb=Control, lb=SG Whereas for most actual controls, you have the pattern: 200E: wb=Format, sb=Format, gcb=Control, lb=CM Contrast that with PUA characters: E000: wb=XX, sb=XX, gcb=XX, lb=XX I think you would get the best outcome if you simply remove the surrogate code points from the derivation of gcb=Control, so they would default to gcb=XX. (Note that in the algorithm, they should then fall through to the Any รท Any rule, and would end up breaking the same.) Then if lb=SG were treated as lb=XX, rather than just being defined as undefined, any isolated surrogate code point would, for the purposes of testing, end up just like a PUA code point for all the segmentation tests. So at that point, you wouldn't need to actually have D800 values in the test files. Proposal Given that analysis, and if no holes are poked in my conclusion during discussion, I propose the following change for the Grapheme_Cluster_Break (gcb) property in Unicode 12.0: Change the assignment of surrogate code points from gcb=Control to gcb=XX (the default). Implications Because GraphemeBreakProperty.txt is generated by the Unicodetools, rather than being a primary UCD data file, it would require a small (presumably single-line) change in the tools to remove the range D800..DFFF from the list which is assigned gcb=Control. The corresponding test data (GraphemeBreakTest.txt) would also need to be generated and checked. Given the way test data is auto-generated by the tooling, the regeneration would presumably update the test data so that no isolated surrogates would occur in the test data, as their testing would no longer be significantly different from PUA or reserved code points. However, the maintainers of the Unicodetools code would need to verify that for the release. The implications for linebreaking are a little different. I am not advocating any change to the Line_Break property value for isolated surrogates, but some small updates to the text in UAX #14 might be useful. Those changes are not urgent for 12.0, so they could await some future update for UAX #14.