Re: Counting characters or bytes in UTF-8?

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Mon Sep 11 2000 - 11:40:32 EDT

Next message: addison@inter-locale.com: "Re: Counting characters or bytes in UTF-8?"
Previous message: Michael \(michka\) Kaplan: "Re: Tamil glyphs"
Maybe in reply to: Lars Marius Garshol: "Counting characters or bytes in UTF-8?"
Next in thread: addison@inter-locale.com: "Re: Counting characters or bytes in UTF-8?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Actually, characters would never be *required* to equal bytes in non-Unicode
situations.... take DBCS, for example. UTF-8 becomes another multibyte
encoding.

It is *easier* and sometimes required for things like buffer size to pass a
count of bytes. But there are also times that you might care about the count
of characters, such as in rendering and validation of size.

So, the best answer is that it depends on what you are doing with the value.
From a perf standpoint, it might be bad to have to continually recalculate
cch of a string if you only ever pass cb. Perhaps it would make sense
sometimes to pass both pieces of information along with the string. It all
depends.....

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/
----- Original Message -----
From: "Lars Marius Garshol" <larsga@garshol.priv.no>
To: "Unicode List" <unicode@unicode.org>
Sent: Monday, September 11, 2000 1:23 AM
Subject: Counting characters or bytes in UTF-8?

>
> I'm working on a C/C++ application that runs on many different
> platforms and supports Unicode, mostly using the old C string library
> functions. This application can be compiled to either support Unicode
> internally using UTF-16 or to not support Unicode at all. However, for
> some platforms it seems that we may want to compile it to use UTF-8
> internally.
>
> We have a uni_strncpy function name that is mapped to some function
> that performs the same task as the standard strncpy function and the
> name is mapped differently depending on platform and internal text
> encoding.
>
> The question is what the 'n' argument counts. In 16-bit mode it is
> obviously characters and in non-Unicode mode there is no distinction
> between bytes and characters. However, what do we count with UTF-8?
> My intuition tells me that it will be bytes, since the function will
> not be aware that it is processing UTF-8 at all.
>
> Can someone confirm or deny this?
>
> --Lars M.
>
>

Next message: addison@inter-locale.com: "Re: Counting characters or bytes in UTF-8?"
Previous message: Michael \(michka\) Kaplan: "Re: Tamil glyphs"
Maybe in reply to: Lars Marius Garshol: "Counting characters or bytes in UTF-8?"
Next in thread: addison@inter-locale.com: "Re: Counting characters or bytes in UTF-8?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT