Re: Deleting Lone Surrogates

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 4 Oct 2015 15:56:42 +0200

IMHO, isolate surrogates are not valid starters for combining sequences,
they must remain isolate : deleting this surrogate in your text editor
should not delete the following combining mark which is a separate cluster
(even if that cluster is defective before the deletion as it has NO base
starter)
For default grapheme clusters, it would be helpful to add a rule to force a
cluster break before and after any lone surogate (i.e. for grapheme cluster
breaking, treat any lone character as if it were a control like NUL U+0000).

2015-10-04 15:02 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> In the absence of a specific tailoring, is the combination of a lone
> surrogate and a combining mark a user-perceived character? Does a lone
> surrogate constitute a user-perceived character?
>
> The problem I have is that because of an application-specific bug,
> when I attempt to enter the sequence <U+1148F TIRHUTA LETTER KA,
> U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
> unit sequence <D805 DC8F D805 D805 DCBA>, which is being interpreted as
> the codepoint sequence <U+1148F, U+D805, U+114BA>.
>
> (The problem seems to arise because I use a sequence of two key strokes
> to enter candrabindu, and the application or input mechanism has to undo
> the entry of a supplementary character entered in response to the first
> keystroke. I've reported the problem as Bug 94753.)
>
> Because the lone surrogate is interpreted as the start of a
> user-perceived character, I can move the cursor to between U+1148F and
> U+D805. Then pressing the 'delete' key (as opposed to the 'rubout'
> key) will delete the U+D805. However, if the lone surrogate plus
> combining mark is a user-perceived character, then all I will be left
> with is <U+1148F>. At present the offending application is treating
> Tirhuta combining marks as user-perceived characters, but I suspect the
> application has simply not caught up with Unicode Version 7 yet.
>
> Richard.
>
Received on Sun Oct 04 2015 - 08:58:00 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 08:58:00 CDT