Re: Counting characters or bytes in UTF-8?

From: Mark Davis (markdavis@ispchannel.com)
Date: Mon Sep 11 2000 - 22:40:05 EDT


In general, we've found it far better to have low-level routines always
have APIs in terms of code units that they implement (e.g. bytes in this
case), and add higher-level routines that provide other interesting
boundary information (e.g. code point boundaries, grapheme boundaries,
word boundaries, etc.) in terms of those code unit counts. Most of the
time you always want to use the code unit interfaces; better performance
and predictability. For the few times that you need to have other
boundaries, you call the other routines to convert between the two
"spaces".

So if your basic strings use byte pointers and byte arrays, then your
string handling routines should also use byte counts and offsets. If you
are interested in other boundaries within those strings, you can add
matching routines that convert back and forth between the boundaries that
you want and the byte locations, such as the following pseudocode APIs:

int graphemeOffset = byte2graphemeOffset(string s, int byteOffset);
int byteOffset = grapheme2byteOffset(string s, int graphemeOffset);

So to copy the 3rd to 6th graphemes from a string to another, you would
have something like the following:

int start = grapheme2byteOffset(source, 3);
int end = grapheme2byteOffset(source, 7);
copy(source, start, end, target);

The same technique works whether the boundaries you are interested in are
code point boundaries, UTF-16 code unit boundaries, word boundaries, etc.
Of course, you can always produce extra APIs that are shorthand for
looking up the boundaries, then doing something (e.g copy), but it means
a. duplicating whole rafts of APIs,
b. getting really complicated when you do that for all the different
boundaries that people might be interested in.
c. having users get muddled about which APIs take which types of
boundaries.

I would only recommend having these shortcuts if performance analysis
showed that they were a clear win.

Mark

Lars Marius Garshol wrote:

> I'm working on a C/C++ application that runs on many different
> platforms and supports Unicode, mostly using the old C string library
> functions. This application can be compiled to either support Unicode
> internally using UTF-16 or to not support Unicode at all. However, for
> some platforms it seems that we may want to compile it to use UTF-8
> internally.
>
> We have a uni_strncpy function name that is mapped to some function
> that performs the same task as the standard strncpy function and the
> name is mapped differently depending on platform and internal text
> encoding.
>
> The question is what the 'n' argument counts. In 16-bit mode it is
> obviously characters and in non-Unicode mode there is no distinction
> between bytes and characters. However, what do we count with UTF-8?
> My intuition tells me that it will be bytes, since the function will
> not be aware that it is processing UTF-8 at all.
>
> Can someone confirm or deny this?
>
> --Lars M.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT