Re: Counting Codepoints

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 13 Oct 2015 00:49:29 +0200

2015-10-12 21:38 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Sun, 11 Oct 2015 21:36:49 -0700
> Ken Whistler <kenwhistler_at_att.net> wrote:
>
> > I think the correct answer is probably:
> >
> > (c) The ill-formed three code unit Unicode 16-bit string
> > <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and
> > one uninterpreted (and uninterpretable) high surrogate
> > code unit 0xDC00.
> >
> > In other words, I don't think it is useful or helpful to map isolated,
> > uninterpretable surrogate code units *to* surrogate code points.
> > Surrogate code points are an artifact of the code architecture. They
> > are code points in the code space which *cannot* be represented
> > in UTF-16, by definition.
> >
> > Any discussion about properties for surrogate code points is a
> > matter of designing graceful API fallback for instances which
> > have to deal with ill-formed strings and do *something*. I don't
> > think that should extend to treating isolated surrogate code
> > units as having interpretable status, *as if* they were valid
> > code points represented in the string.
>
> Graceful fallback is exactly where the issue arises. Throwing an
> exception is not a useful answer to the question of how many code
> points a 'Unicode string' (not a 'UTF-16 string') contains.
>

It really is a **useful** answer because there's actually no correct
answer, unless you assume some (not clearly defined) sanitization process
(removal or part of the text means you give an answer about a different
text, substitution is also not clearly defined, you could remove everything
after the first error encountered).

If you get an invalid UTF-16 string, and caught an exception, this is a
sign that it is not UTF-16, and very frequently something else. The
application may want to retry with another encoding, possibly using
heuristic guessers, but the heuristic will only give a *probable answer*.
If this probable answer is still UTF-16, the application may or may not
want to alter the input text and instruct the function to perform a
specific "sanitization", but this process is NOT defined in the UTF-16
specification itself, the result will be a local-only decision, which may
not match what other systems will do (other systemls may fallback to an
encoding that produces no error at all such as ISO8859-1 or a default
encoding of the system such as CP437. But as this wil frequently produce
"mojibake", it is best to notice it, log that for later manual processing
(if needed) and discard that text completely as invalid (the standard
behavior for UTF-16 for conforming applications).

Any sanitization will be errorprone as it will always be an heuristic,
users should have some visible notification that the input was invalid, and
the "correction" should not be automated unless the users really ask for it
and the application offers a choice of options. The minimum being that the
application should offer a visual inspection to the user for each option.
But we are then completely out of scope of the UTF-16 standard itself.
Received on Mon Oct 12 2015 - 17:51:34 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 12 2015 - 17:51:34 CDT