Re: Why Work at Encoding Level? from Daniel Bünzli on 2015-10-13 (Unicode Mail List Archive)

From: Daniel Bünzli <daniel.buenzli_at_erratique.ch>
Date: Wed, 14 Oct 2015 00:28:26 +0100

Le mardi, 13 octobre 2015 à 23:37, Richard Wordingham a écrit :
> If you are referring to indexing, I suspect the issue is performance.
> UTF-32 feels wasteful, and if the underlying character text is UTF-8 or
> UTF-16 we need an auxiliary array to convert character number to byte
> offset if we are to have O(1) time for access.

If UTF-32 feels wasteful there are various smart ways of providing direct indexing at a reasonable cost if you are in a language that has minimal support for datatype definition and abstraction.

Also I personally find indexing to be rarely useful in string processing, so it may not be the operation you want to optimize for. Having iterators-like functions as you suggest and a datatype to represent substrings seems often a better fit than doing indexing arithmetic.

Note that the Swift programming language seems to have gone even further than I would have: their notion of character is a grapheme cluster tested for equality using canonical equivalence and that's what they index in their strings, see [1]. Don't know how well that works in practice as I personally never used it; but it feels like the ultimate Unicode string model you want to provide to the zero-knowledge Unicode programmer (at least for alphabetic scripts).

Best,

Daniel

[1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
Received on Tue Oct 13 2015 - 18:30:15 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 18:30:15 CDT