Re: Deleting Lone Surrogates

From: Asmus Freytag (t) <>
Date: Sun, 4 Oct 2015 14:29:16 -0700
On 10/4/2015 12:38 PM, Richard Wordingham wrote:
On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer <> wrote:

I would not spend any time specifying intricate rules for unpaired
surrogates in 16-bit strings, or out-of range values in 32-bit
strings. Most processing will treat them like unassigned characters,
like U+50005, with only default behaviors.
The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster.  I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.

The problem you are trying to solve is to allow editing on the code point level, or, if you will, the keystroke level. Generally, there will be a sweet spot for each language (and each user) with respect to what to erase or undo.

For sequences that belong to a given language, you can pick the behavior that makes most sense in them, but for lone surrogates, by definition you are dealing with broken text that doesn't follow any conventions.

It should also be something that doesn't occur commonly. So, for all of those reasons, I see no particular problem with giving that a "generic" behavior, which could be that of deleting the entire combining sequence; especially if your interface normally deletes sequences as a unit.

If it never treats sequences as units, then I would in fact question why this should be different for surrogates.

But in any case, the minimal requirement on an editor is that it lets you delete (and then retype) enough text to get it back to an uncorrupted state.

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command.  This trick is unlikely to work for
a lone surrogate.

Catch-22 here. In filtering input to the dialog to prevent it from being used to corrupt text, you prevent it from being used to repair text. Interesting.

Received on Sun Oct 04 2015 - 16:30:25 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 16:30:25 CDT