Re: Deleting Lone Surrogates from Philippe Verdy on 2015-10-04 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 4 Oct 2015 21:48:12 +0200

2015-10-04 21:30 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Sun, 4 Oct 2015 15:44:32 +0200
> Mark Davis ☕️ <mark_at_macchiato.com> wrote:
>
> > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show
> > the sequence 𑒏�𑒺 as just two grapheme clusters.
>
> But that's the sequence <U+1148F, U+FFFD, U+114BA>, which has no lone
> surrogates at all! (I had to look at the raw email file to be sure of
> what the text was - my email client displays U+FFFD and malformed
> alleged UTF-8 the same.)

Mark just said that it was what was shown, i.e. the lone surrogate got
treated as U+FFFD.
However my opinion is that 𑒏�𑒺 (using U+FFFD substitution) gives 2
grapheme clusters, I would prefer a solution that gives 3 grapheme
clusters, as if the lone surrogate was a line-break control, so that the
third character (combining, but just after the lone surrogate) will not
combine with it but will be handled as a defective combining sequence with
no starter at all before it.
Received on Sun Oct 04 2015 - 14:49:27 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 14:49:27 CDT