Re: Counting Codepoints

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Tue, 13 Oct 2015 14:08:28 +0200

On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> Rather the question must be the unwieldy one of how
> many scalar values and lone surrogates it contains in total.
>

​That may be the question in theory; in practice no programming language is
going to support APIs like that. So the question is whether your original
question was purely theoretical, or was about some particular
language/environment.

If the latter, then looking at the behavior of related functions in that
environment, like traversing a string, and counting in a way that is most
consistent with their behavior, is the least likely to cause problems.

For example, Java is pretty consistent; each of the following returns 2 as
the count.

    String test = "\uDC00\uD800\uDC20";
    int count = test.codePointCount(0, test.length());
    *System.out.println("codePointCount:\t" + count);*

    count = 0;
    int cp;
    for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
      cp = test.codePointAt(i);
      count++;
    }
    *System.out.println("Java 7 iteration:\t" + count);*

    count = 0;
    for (int cp2 : test.codePoints().toArray()) {
      count++;
    }
    *System.out.println("Java 8 iteration:\t" + count);*

// for the last, could just call: *count = (int) test.codePoints().count();*

The isolate surrogate code unit is
​consistently treated
as the corresponding surrogate code point, which is what
​anyone would

​reasonably ​
expect.

Mark
Received on Tue Oct 13 2015 - 07:10:01 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 07:10:02 CDT