From: Theodore H. Smith (firstname.lastname@example.org)
Date: Wed Dec 01 2004 - 16:40:06 CST
Assuming you had no legacy code. And no "handy" libraries either,
except for byte libraries in C (string.h, stdlib.h). Just a C++
compiler, a "blank page" to draw on, and a requirement to do a lot of
Unicode text processing.
Apart from that the real world would still apply, you may want to use
this code in the future, you may want to use it from within other
environments, but even those environments wouldn't have legacy code or
existing Unicode libraries.
Obviously, if you had existing libraries that made things "easier" (so
you hope), or legacy code that used UCS2 then things are different, so
we are ignoring these cases. Its all fresh.
What would be the nicest UTF to use?
I think UTF8 would be the nicest UTF.
Some people say that UTF32 offers more simpler better code, from an
architectural view-point. After all, every character is one variable in
But does UTF32 offer simpler better faster cleaner code?
A Unicode "character" can be decomposed. Meaning that a character could
still be a few variables of UTF32 code points! You'll still need to
carry around "strings" of characters, instead of characters.
The fact that it is totally bloat worthy, isn't so great. Bloat mongers
aren't your friend.
The fact that it is incompatible with existing byte code doesn't help.
UTF8 can be used with the existing byte libraries just fine. The
"problem" of multiple code points for a decomposed char isn't
relatively worse, because we still have decomposed char in UTF32.
An accented A in UTF-8, would be 3 bytes decomposed. In UTF32, thats 8
Also, UTF-8 is a great file format, or socket-transfer format. Not
needing to convert is great. Its also compatible with C strings.
Also, UTF-8 has no endian issues.
Also, UTF-8's compactness makes it great for processing large volumes
I think that UTF16 is really bad. UTF16 is basically popular, because
so many people thought UCS2 was the answer to internationalisation.
UTF16 was kind of a "switch and bait" technique (unintentional of
course). Had it been known that we need to treat characters as multiple
units of variables, we might as well have gone for UTF8!
The people who like UTF16 because UTF8 takes 3 bytes where UTF16 takes
2 for their favourite language... I can see their point. But even then,
with the prevalence of markup, and the prevalence of 1 byte
punctuation, the trade-off is really quite small. UTF-8 (byte)
processing code is also more compatible with that Unicode compression
scheme whose acronym I forget (something like SCSU).
I think with that compression scheme, the Unicode text would be even
smaller than UTF16. SCSU (or whatever its called) can be processed as
markup (XML for example) with no decompression, so its quite handy.
Anyhow, thats why I think UTF-8 is really the way to go.
Its too bad MicroSoft and Apple didn't realise the same, before they
made their silly UCS-2 APIs.
-- Theodore H. Smith - Software Developer - www.elfdata.com/plugin/ Industrial strength string processing code, made easy. (If you believe that's an oxymoron, see for yourself.)
This archive was generated by hypermail 2.1.5 : Wed Dec 01 2004 - 16:42:45 CST