From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Dec 02 2004 - 06:12:20 CST
On Wednesday, December 01, 2004 22:40Z Theodore H. Smith va escriure:
> Assuming you had no legacy code. And no "handy" libraries either,
> except for byte libraries in C (string.h, stdlib.h). Just a C++
> compiler, a "blank page" to draw on, and a requirement to do a lot of
> Unicode text processing.
<...>
> What would be the nicest UTF to use?
There are other factors that might influence your choice.
For example, the relative cost of using 16-bit entities: on a Pentium it is
cheap, on more modern X86 processors the price is a bit higher, and on some
RISC chips it is prohibitive (that is, short may become 32 bits; obviously,
in such a case, UTF-16 is not really a good choice). On the other extreme,
you have processors where byte are 16 bits; obviously again, then UTF-8 is
not optimum there. ;-)
Also, it may influence if you have write access to the sources for your
library: if yes, then it could be possible (at a minimal adaptation cost) to
use it to handle 16-bit ot 32-bit characters. Even more interesting, this
might already exist, in form of the wcs*() functions of the C95 Standard.
It also depends, obviously, on the kind of processing you are doing. Some
are mainly handling strings, so the transformation format is not the most
important thing. Yet others are handling characters, and then UTF-8 is less
adequate because of the cost of relocating. On the other hand texts are
stored in external files, and if the external format is UTF-8 or based on
it, then it might be a bias toward it.
And finally it may depend on how many different architectures you need to
deploy your programs. C is great for its portability, yet portability is a
tool, not a necessary target. An unique user usually does not care how
portable is the program he is using, provide it does the job and it results
cheap (or not too expensive). I agree portability is a good point for IT
managers (because it foments competition, with is good to cut costs.) But on
the other hand, too much portability can be counter-productive to everyone
(for example, writing a text processor in C which allows characters to be
stored directly as 8-bit as well as UTF-16 bytes. Or using long for
everything, in order to be potentially portable to 16-bit ints, even if the
storage limitations will impede practical use.)
I believe the current availability of 3 competitive formats is a fact that
we have to accept. It is certainly not as optimum as the prevalence of ASCII
may have been. It is certainly a bad thing for some suppliers such as those
that are writing those libraries, because it means ×3 work for them and an
augmentated price for their users (being in sales price or being in delay of
availability of features/bug corrections/etc.) Moreover, the present
existence of widely available yet incompatible installed bases for at least
two of the formats (namely UTF-16 on Windows NT and UTF-8 on Internet
protocols) means additional costs for about all the industry. This may mean
more workload for those that are actually working in this area ;-), but also
more pression upon them from part of their managements, and results in waste
when seen from the client side, so not a good thing for marketing.
Yet it is this way, and I assume we cannot do many things to cure that.
Now let's proceed to read the rest...
> I think UTF8 would be the nicest UTF.
So that is your point of view.
> But does UTF32 offer simpler better faster cleaner code?
Perhaps you can actually try to measure it.
> A Unicode "character" can be decomposed. Meaning that a character
> could still be a few variables of UTF32 code points! You'll still
> need to carry around "strings" of characters, instead of characters.
This sillogism is assuming that any text handling requires decomposition. I
disagree with this.
> The fact that it is totally bloat worthy, isn't so great. Bloat
> mongers aren't your friend.
Again, do you care to offer us any figures?
> The fact that it is incompatible with existing byte code doesn't help.
See above.
> UTF8 can be used with the existing byte libraries just fine.
It depends on what you want to do. For example, using strchr()/strspn() and
the like may be great if you are dealing with some sort of tagged formats
such as SGML; but if your text uses U+2028 as end-of-line indicator, it
suddently becomes not so great...
> An accented A in UTF-8, would be 3 bytes decomposed.
Or more.
> In UTF32, thats 8 bytes!
And so? Nobody is saying that UTF-32 is space efficient. In fact, UTF-32
specifically trade space against other advantages. If you are space-tight,
then obviously UTF-32 is not a choice. That is another constraint. Which you
did not add to the list above.
On the other hand, nowadays, the general use workstation used for text
processing has several hundred of megabytes of memory. That is, several
scores of megabytes of UTF-32 characters, decomposed and so on.
The biggest text I have at hand is below 15 M. And when I have to deal with
it, I am quite clearly I/O-bounded, not memory-bounded.
> Also, UTF-8 is a great file format, or socket-transfer format.
You are using sockets xeno-IPC for intensive text processing? Do you really
believe it is representative?
> Not needing to convert is great.
I must be missing something here. You are starting living in a perfect
world, with no legacy. Yet you need to convert for external interfacing...
> Its also compatible with C strings.
Specifically not a good point to mention here these days. :-)
> Also, UTF-8 has no endian issues.
And?
If anyone were using network order everywhere, there will not be a problem
with the other formats either. Alas, while Intel does provide elementary
instructions to deal with that at the lower level, their use is not
practical and even less efficient. And I see nothing like bi-endianness
coming in future processors from Intel.
Also, if you are living purely in a Windows world, neither is endianness an
actual problem as far as I can see.
> Also, UTF-8's compactness makes it great for processing large volumes
> of UTF-8.
Is it a real problem? Have you figures?
I happen to have suuffered of this kind of problems some years ago. My box
only had 32 then 64 MB, and I was dealing with multi-Mc texts. And UTF-8
proved to be am important constraint factor when compared with legacy (here
ISCII/CSX) encodings, when it comes to memory size . . .
> I think that UTF16 is really bad. UTF16 is basically popular, because
> so many people thought UCS2 was the answer to internationalisation.
> UTF16 was kind of a "switch and bait" technique (unintentional of
> course). Had it been known that we need to treat characters as
> multiple units of variables, we might as well have gone for UTF8!
You are perhaps missing something here.
UCS-2 (that is, the difference between Unicode vs. DIS 10646) was viewed
with much expectation back in the '90s when enginneers were tired to have to
deal with multibytes (including state encodings), that fitted pretty badly
within the scheme of the existing (ASCII- or EBCDIC-based) softwares. Of
course, the result was not at the level of the most optimistic expectations.
Part of this comes from the UTF-16 kludge, part from the problems related
with decompositions, as you mentionned, and also from other problems, such
as the lack of easy transposition of determined mechanisms like <ctype.h>
for example; or the underlying fact that we should deal with strings rather
than characters as C and the usual L3G programming languages invite us to
do.
However, UTF-8 on the other side is a step back to multibytes. Yes, it is
stateless; it is easy to synchronize; and also in the mean time, software
enginneers did learn and many existing code is not multibyte-hostile
anymore. In other words, it is a very good variable-sized encoding. Whcih
does not prevent it to be a variable-sized encoding.
> The people who like UTF16 because UTF8 takes 3 bytes where UTF16 takes
> 2 for their favourite language... I can see their point. But even
> then, with the prevalence of markup, and the prevalence of 1 byte
> punctuation, the trade-off is really quite small.
Figures?
Also, you do use U+2028 as line separator, as Unicode mandates, don't you?
> UTF-8 (byte) processing code is also more compatible with that Unicode
> compression scheme whose acronym I forget (something like SCSU).
I am not sure that text processing should take an appreciable part of the
time doing compression-decompression. In fact, if things are so, something
seem wrong to me.
> Its too bad MicroSoft and Apple didn't realise the same, before they
> made their silly UCS-2 APIs.
You began by considering the perfect world of pure text processing: so any
argument related to file systems or use of string atoms in the API are
deemed irrelevant, and I will abstain myself to bring them in.
However what appears to me basically inacceptable is to bash Microsoft or
Apple for lack of vision when it comes to the APIs they designed 10-15 years
ago, yet to consider ANSI (89, not 95) C libraries on a octet-oriented
machine as the only available alternative in 2004, when looking toward the
future.
Antoine
This archive was generated by hypermail 2.1.5 : Thu Dec 02 2004 - 06:18:32 CST