Why Work at Encoding Level?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 13 Oct 2015 23:37:06 +0100

On Tue, 13 Oct 2015 16:09:16 +0100
Daniel Bnzli <daniel.buenzli_at_erratique.ch> wrote (under topic heading
'Counting Codepoints')

> I don't understand why people still insist on programming with
> Unicode at the encoding level rather than at the scalar value level.
> Deal with encoding errors and sanitize your inputs at the IO boundary
> of your program and then simply work with scalar values internally.

If you are referring to indexing, I suspect the issue is performance.
UTF-32 feels wasteful, and if the underlying character text is UTF-8 or
UTF-16 we need an auxiliary array to convert character number to byte
offset if we are to have O(1) time for access.

This auxiliary array can be compressed chunk by chunk, but the larger
the chunk, the greater the maximum access time. The way it could work
is a bit strange, because this auxiliary array is redundant. For
example, you could use it to record the location of every 4th or every
5th codepoint so as to store UTF-8 offset variation in 4 bits, or every
15th codepoint for UTF-16. Access could proceed by looking up the
index for the relevant chunk, then adding up nibbles to find the
relevant recorded location within the chunk, and then use the basic
character storage itself to finally reach the intermediate points.

(I doubt this is an original idea, but I couldn't find it expressed
anywhere. It probably performs horribly for short strings.)

Perhaps you are merely suggesting that people work with a character
iterator, or in C refrain from doing integer arithmetic on pointers
into strings.

Received on Tue Oct 13 2015 - 17:38:38 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 17:38:38 CDT