Re: Deleting Lone Surrogates

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 4 Oct 2015 20:38:02 +0100

On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit
> strings. Most processing will treat them like unassigned characters,
> like U+50005, with only default behaviors.

The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster. I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command. This trick is unlikely to work for
a lone surrogate.

Richard.
Received on Sun Oct 04 2015 - 14:39:09 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 14:39:09 CDT