Re: Counting Codepoints

From: Mark Davis ☕️ <>
Date: Tue, 13 Oct 2015 14:08:28 +0200

On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <> wrote:

> Rather the question must be the unwieldy one of how
> many scalar values and lone surrogates it contains in total.

​That may be the question in theory; in practice no programming language is
going to support APIs like that. So the question is whether your original
question was purely theoretical, or was about some particular

If the latter, then looking at the behavior of related functions in that
environment, like traversing a string, and counting in a way that is most
consistent with their behavior, is the least likely to cause problems.

For example, Java is pretty consistent; each of the following returns 2 as
the count.

    String test = "\uDC00\uD800\uDC20";
    int count = test.codePointCount(0, test.length());
    *System.out.println("codePointCount:\t" + count);*

    count = 0;
    int cp;
    for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
      cp = test.codePointAt(i);
    *System.out.println("Java 7 iteration:\t" + count);*

    count = 0;
    for (int cp2 : test.codePoints().toArray()) {
    *System.out.println("Java 8 iteration:\t" + count);*

// for the last, could just call: *count = (int) test.codePoints().count();*

The isolate surrogate code unit is
​consistently treated
as the corresponding surrogate code point, which is what
​anyone would

​reasonably ​

Received on Tue Oct 13 2015 - 07:10:01 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 07:10:02 CDT