Re: Counting Codepoints

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Mon, 12 Oct 2015 14:42:57 +0200

I agree with Ken on "Any discussion about properties for surrogate code
points is a matter of designing graceful API fallback for instances which have
to deal with ill-formed strings and do *something*.", and here's be my
advice based on that.

You want the code point count to reflect the same count that you would get
if you were to "sanitize" the string by fixing the isolated surrogates when
converting to valid UTF-16 from a a 16-bit Unicode String. Sanitizing
*never* should involve deletion (for security reasons). The best practice
is to replace them by FFFD, according to the guidelines in TUS Chapter 3.
Constraints on Conversion Processes

And you want it to reflect the same code point count that you would get in
common APIs that traverse 16-bit Unicode String. And I don't know of any
code point iterators that just *skip* the isolates; they are typically
returned as single code points.

If these are not all aligned, then all heck breaks loose: you are letting
yourself in for code breakage and/or security problems.

So the corresponding code point count would just return a count of 1 for an
isolated surrogate.

UTF-8 is gummier. I'd return according to whatever the standard practice in
the programming environment for "sanitizing" output is. That could be the
"maximal subpart" approach in TUS Ch. 3, or it could be an alternative
approach: consistency with the approach in use is the most important
feature.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Mon, Oct 12, 2015 at 6:36 AM, Ken Whistler <kenwhistler_at_att.net> wrote:

>
>
> On 10/11/2015 2:20 PM, Richard Wordingham wrote:
>
>> Is the number of codepoints in a UTF-16 string well defined?
>>
>> For example, which of the following two statements are true?
>>
>> (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00,
>> 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020.
>>
>> (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00,
>> 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20.
>>
>> Statement (a) is probably more useful, but I couldn't find anything to
>> rule that statement (b) is false.
>>
>
> I think the correct answer is probably:
>
> (c) The ill-formed three code unit Unicode 16-bit string
> <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and
> one uninterpreted (and uninterpretable) high surrogate
> code unit 0xDC00.
>
> In other words, I don't think it is useful or helpful to map isolated,
> uninterpretable surrogate code units *to* surrogate code points.
> Surrogate code points are an artifact of the code architecture. They
> are code points in the code space which *cannot* be represented
> in UTF-16, by definition.
>
> Any discussion about properties for surrogate code points is a
> matter of designing graceful API fallback for instances which
> have to deal with ill-formed strings and do *something*. I don't
> think that should extend to treating isolated surrogate code
> units as having interpretable status, *as if* they were valid
> code points represented in the string.
>
> It might be easier to get a handle on this if folks were to ask, instead
> how many code points are in the ill-formed Unicode 8-bit
> string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes,
> but how many code points? I'd say two code points and
> 4 uninterpretable, ill-formed UTF-8 code units, rather than
> any other possible answer.
>
> Basically, you get the same kind of answer if the ill-formed string
> were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points
> and 3 uninterpretable, ill-formed UTF-8 code units. That is a
> better answer than trying to map 0xED 0xA0 0x80 to U+D800
> and then saying, oh, that is a surrogate code *point*.
>
> --Ken
>
>
>
Received on Mon Oct 12 2015 - 07:44:58 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 12 2015 - 07:44:58 CDT