On Sun, 4 Oct 2015 10:50:43 -0700 Markus Scherer <markus.icu@gmail.com> wrote:I would not spend any time specifying intricate rules for unpaired surrogates in 16-bit strings, or out-of range values in 32-bit strings. Most processing will treat them like unassigned characters, like U+50005, with only default behaviors.The core problem here is that many editors will not allow one to delete just a non-initial character from a grapheme cluster. I fear there may be editors that don't even allow one to delete the final character. This may not be a problem when one works with a small set of grapheme clusters, as in French or German, or possibly even Vietnamese, but becomes a problem when working with such a large set that the notion of them being user-perceived characters strains credulity.
A stray U+50005 before a combining mark would also be fiddly to get rid of, but even if the editor does not allow the entry of arbitrary scalar values, a user might fix the problem by creating an HTML file containing the character and then copying the character from the HTML file to a find and replace command. This trick is unlikely to work for a lone surrogate.
This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 16:30:25 CDT