Re: Counting Codepoints

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 13 Oct 2015 07:36:30 +0100

On Tue, 13 Oct 2015 00:49:29 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2015-10-12 21:38 GMT+02:00 Richard Wordingham <
> richard.wordingham_at_ntlworld.com>:
> > Graceful fallback is exactly where the issue arises. Throwing an
> > exception is not a useful answer to the question of how many code
> > points a 'Unicode string' (not a 'UTF-16 string') contains.

> If you get an invalid UTF-16 string, and caught an exception, this is
> a sign that it is not UTF-16, and very frequently something else. The
> application may want to retry with another encoding, possibly using
> heuristic guessers, but the heuristic will only give a *probable
> answer*.

On Mon, 12 Oct 2015 23:35:32 +0000
David Starner <prosfilaes_at_gmail.com> wrote:

> Thus a Unicode string simply can't be in UTF-16 format
> internally with unpaired surrogates; a Unicode string in a programmer
> opaque format must do something with broken data on input.

You're assuming that the source of the non-conformance is external to
the program. In the case that has caused me to ask about lone
surrogates, they were actually caused by a faulty character deletion
function within the program itself. Despite this fault, the program
remains usable - it's little worse than a word processor that insists on
autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'.

I presume you are expecting input of fractional characters to be
buffered until there is a whole character to add to a string. For
example, a MSKLC keyboard will deliver a supplementary character in
two WM_CHAR messages, one for the high surrogate and one for the low
surrogate.

Returning to the original questions, it would seem that there is not a
unique answer to the question of how many codepoints a Unicode 16-bit
string contains. Rather the question must be the unwieldy one of how
many scalar values and lone surrogates it contains in total.

Richard.
Received on Tue Oct 13 2015 - 01:37:51 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 01:37:52 CDT