(M) best way to handle Unicode data internally?

From: Marco Mussini (marco.mussini@vim.tlt.alcatel.it)
Date: Mon Aug 31 1998 - 04:41:06 EDT


I would like to hear your opinion about the following point:

Is it better to use UTF-8 or UCS-16 for internal string data processing
in an application?

If the external world is speaking UTF-8 (browsers, for instance) there
is clearly the need to convert back and forth between UTF-8 and UCS16
whenever the frontier is passed. So it seems that using UTF-8 internally
as well would save conversion time.
On the other hand, perhaps string processing in UTF-8, which is variable
length, is more computationally expensive for some operations (scanning
through a loop one character at a time requires to ask a routine how
long is the representation of the next character, because its length
isn't fixed like in UCS-16bit)
This shows there is a tradeoff to evaluate between number of frontier
trespassings and number of string processing occasions in the program.

Dimensioning the internal buffers is another point.

Clearly, if Unicode is used only to represent 7-bit ASCII data, UTF-8
lets you save 50% space in comparison to UCS 16BIT. But I don't think
Unicode was invented, as should be used, to do this. If you use the
extended characters, then with UTF-8 you use up at least 16 bits every
time you use something that's not included in 7-bit ASCII. And since
UTF-8 wastes some bits, in some (most?) cases you'll end up using more
than 2 bytes per character. In an application that runs mostly in
CHinese/Japanese, this means wasting space.

Moreover, if the worst-case length of a character in UTF-8 is let's say
4 bytes, to be sure that your buffer holds 10 characters you'll have to
dimension it 40 bytes. THis is twice as much than UCS-16, where 20 bytes
would be always enough. This problem applies to database table size
setting as well.

I think UTF-8 can cause a waste of buffer space and perhaps an increase
of computation time in comparison to UCS16, as long as the internal use
is concerned.

Unfortunately most OSs that do offer today support for some flavour of
Unicode in their API offer it in UTF-8 and not UCS16.

Java internally works in UCS16 (ideally) but it is likely to communicate
to the external world in UTF-8.

Is it reasonable to use UCS16 for external communications? Do you know
if browsers will support it, or they will probably stick with UTF7 and

I would be glad to hear your opinion on this.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT