Re: Counting characters or bytes in UTF-8?

From: addison@inter-locale.com
Date: Tue Sep 12 2000 - 05:08:40 EDT


Hi Lars,

I'm going to skim the surface in this message. Hopefully this helps: I
think, based only on your question, that you may need to think about your
design a little more. Obviously, more details would help.

1. One UTF-16 code unit does not equal one character. With the advent of
Unicode 3.1 you will need to support surrogates (with two code units per
Unicode character). And you should think about whether it is appropriate
to divide a grapheme (sequence of Unicode characters that form one glyph
on the screen--think combining marks). Most Unicode function libraries
*do* divide graphemes, but you will need to have a character properties
class to help the caller make good decisions about string
division. So: strncpy() doesn't copy exactly "n" Unicode
"characters" (words) in UTF-16.

2. The original intent of strncpy() was to provide a means of copying both
bytes and characters. Since the assumption was 1 byte == 1 char, there was
no problem with this. In addition to the problem in #1, though, UTF-8
introduces these issues:

a) If the caller wishes to copy a certain number of characters, your code
will have to be UTF-8 aware and count characters.

b) If the caller wishes to copy a certain number of bytes (say there is a
limited length buffer), then your function should copy the maximum number
of characters that will fit into the "n" bytes provided. So if the string
contains the character U+FF10 (three bytes to encode in UTF-8) and the
buffer provides two bytes of storage, there should be null string
returned.

c) You need to provide a function to discover how many bytes of storage
are required to return 'x' characters, so the caller can allocate
storage. C is particularly unforgiving in this regard.

You will also want to rethink strlen (bytes or characters??)... you
probably want to have two functions here. Also strcmp, is*, and so on. The
C library is *not* a good model for a Unicode string library (albeit every
programmer knows how to use the functions...): the problems illustrated
above demonstrate how sloppy analogies will cause novice (from
a Unicode perspective) programmers to get into a lot of trouble.

Take a look at the ICU library
(http://oss.software.ibm.com/developerworks/icu) for some further
pointers.

Best Regards and Good Luck,

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Mon, 11 Sep 2000, Lars Marius Garshol wrote:

>
> I'm working on a C/C++ application that runs on many different
> platforms and supports Unicode, mostly using the old C string library
> functions. This application can be compiled to either support Unicode
> internally using UTF-16 or to not support Unicode at all. However, for
> some platforms it seems that we may want to compile it to use UTF-8
> internally.
>
> We have a uni_strncpy function name that is mapped to some function
> that performs the same task as the standard strncpy function and the
> name is mapped differently depending on platform and internal text
> encoding.
>
> The question is what the 'n' argument counts. In 16-bit mode it is
> obviously characters and in non-Unicode mode there is no distinction
> between bytes and characters. However, what do we count with UTF-8?
> My intuition tells me that it will be bytes, since the function will
> not be aware that it is processing UTF-8 at all.
>
> Can someone confirm or deny this?
>
> --Lars M.
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT