Re: string vs. char [was Re: Java and Unicode]

From: addison@inter-locale.com
Date: Mon Nov 20 2000 - 12:38:45 EST


Hi Jani,

I dunno. I oversimplified in that statement about exposing vs. hiding.

ICU "hides" the facts about the Unicode implementation in macros,
specifically a next and previous character macro and various other
fillips. If you look very closely at the function (method) prototypes you
can see that, in fact, a "character" is a 32-bit entity and a string is
made (conditionally) of 16-bit entities. But, as you suggest, ICU makes it
easy to work with (and is set up so that a sufficiently motivated coder
could change the internal encoding).

<rant>
If you ask a 100 programmers the index of the string, they'll give you the
wrong answer 99 times... because there is little or no I18n training in
the course of becoming a programmer. The members of this list are
continually ground down by the sheer inertia of ignorance (I just gave up
answering one about email... I must have written a response to that
message a bunch of times, but don't have the time or stamina this morning
to go find and rework one of them).
</rant>

In any case this has been a fun and instructive interlude. As I said in
my initial email, I tend to be a CONSUMER of Unicode APIs rather than a
creator. I haven't written a Unicode support package in quite some time
(and the last one was a UTF-8 hack in C++). It's good to be familiar with
the details, but I find that, as a programmer one typically doesn't fully
comprehend the design decisions until one faces them oneself. As it is, I
ended up changing my design and sample code over the weekend to follow the
suggestions of several on this list who've Been There.

As a side note: one of the problems I faced on this project was the need
to keep the Unicode and locale libraries extremely small (this is an
embedded OS). I would happily have borrowed ICU to actually *be* the
library... but it's too large. I've had to design a tiny (and therefore
quite limited) support library. It's been an interesting experience.

Best Regards,

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Mon, 20 Nov 2000, Jani Kajala wrote:

>
> > >The question, I guess, boils down to: put it in the interface, or hide it
> > >in the internals. ICU exposes it. My spec, up to this point, hides it,
>
> (I'm aware that the original question was about C interfaces so you might consider this a bit out of topic but I just wanted to comment about the exposed encoding)
>
> I think that exposing encoding in interfaces doesn't do any good. It violates oriented design principles and it is not even intuitive.
>
> I'd bet that if we take 100 programmers and ask them 'What is this index in context of this string?' in every case we'll get an answer that its of course the nth character position. Nobody who isn't well aware of character encoding will ever think of code units. Thus, it is not intuitive to use indices to point at code units. Especially as Unicode has been so well-marketed as '16-bit character set'.
>
> Besides, you can always use (C++ style) iterators instead of plain indices without any loss in performance or in syntactic convenience. With an 'iterator' in this I refer to simple encapsulated pointer which behaves just as any C++ Standard Template Library random access iterator but takes encoding into account. Example:
>
> for ( String::Iterator i = s.begin() ; i != s.end() ; ++i )
> // ith character in s = *i
> // i+nth character in s = i[n]
>
> The solution works with any encoding as long as string::iterator is defined properly.
>
> The conclusion that using indices won't make a difference in performance also makes sense if you consider the basic underlying task: If you need random access to a string you need to check for characters spanning over multiple code units. So the task is the same O(n) complexity, using indices won't help a bit. If the user needs the access to arbitrary character he needs to iterate anyway. It is just matter how you want to encapsulate the task.
>
>
> Regards,
> Jani Kajala
> http://www.helsinki.fi/~kajala/
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT