From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 02 2006 - 18:24:59 CDT
Theodore Smith said:
> My original point was that UTF-8 can be used for more than it is
> given credit for. You can do lowercasing, uppercasing, normalisation,
> and just about anything, on UTF-8, without corruption or mistakes,
I think most programmers would stipulate that. After all, UTF-8
is an encoding form of Unicode, and is completely interoperable
with UTF-16 or UTF-32. If you can define an algorithm on
UTF-32 and are a competent programmar, you ought to be able to
figure out a way to accomplish the same thing directly on
UTF-8 strings containing the same data.
But...
> and do it CPU efficiently and far more space efficiently.
This is not at all an obvious claim, and it is pretty easy
to come up with trivial support or counterexamples, depending
on the type of data you are dealing with.
If your data is all from the ASCII subset of Unicode, then
of course UTF-8 is going to be far more space efficient, and
you would also be able to optimize down all kinds of operations
to their ASCII equivalent, to be much faster than operations
that have to handle all of Unicode.
If your data is all Chinese, then UTF-8 is guaranteed to
be *less* efficient, both for space and for CPU, for almost
any operation, than UTF-16 would be.
In general, most semantic operations on Unicode strings
require table lookups, and while you can construct table
lookups based directly on UTF-8 byte values, UTF-16 (or
UTF-32) lend themselves to more compact *and* more efficient
table lookups than UTF-8 does.
> And the other point is that a character (aka unicode glyph)
^^^^^^^^^^^^^
This is a misusage of the term "glyph" here, I believe.
> is a
> string. So whatever you do, you'll need to be be string processing,
> treating each character as a variable length unit, so it might as
> well be a variable 8-bit length unit than 32bit...
And while it is true that most Unicode text processing, including
that which is concerned with combining character sequences, is
inherently *string* processing, rather than code point processing,
in almost all cases, any process concerned about the semantics
of the string (and the characters it contains) will need to
be cognizant of character boundaries. The semantics, which
you need to access tables for, inhere to the code points, so
you can't just treat a UTF-8 string as a bag o' bytes for
processing. (Except for trival operations like string copying,
string length for buffer size, and so on.)
>
> Therefor, I win the discussion. Thank you :)
We await the decision of the judges. ;-)
--Ken
This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 18:33:20 CDT