Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 02 2006 - 18:24:59 CDT

  • Next message: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"

    Theodore Smith said:

    > My original point was that UTF-8 can be used for more than it is
    > given credit for. You can do lowercasing, uppercasing, normalisation,
    > and just about anything, on UTF-8, without corruption or mistakes,

    I think most programmers would stipulate that. After all, UTF-8
    is an encoding form of Unicode, and is completely interoperable
    with UTF-16 or UTF-32. If you can define an algorithm on
    UTF-32 and are a competent programmar, you ought to be able to
    figure out a way to accomplish the same thing directly on
    UTF-8 strings containing the same data.

    But...
     
    > and do it CPU efficiently and far more space efficiently.

    This is not at all an obvious claim, and it is pretty easy
    to come up with trivial support or counterexamples, depending
    on the type of data you are dealing with.

    If your data is all from the ASCII subset of Unicode, then
    of course UTF-8 is going to be far more space efficient, and
    you would also be able to optimize down all kinds of operations
    to their ASCII equivalent, to be much faster than operations
    that have to handle all of Unicode.

    If your data is all Chinese, then UTF-8 is guaranteed to
    be *less* efficient, both for space and for CPU, for almost
    any operation, than UTF-16 would be.

    In general, most semantic operations on Unicode strings
    require table lookups, and while you can construct table
    lookups based directly on UTF-8 byte values, UTF-16 (or
    UTF-32) lend themselves to more compact *and* more efficient
    table lookups than UTF-8 does.
     
    > And the other point is that a character (aka unicode glyph)
                                                   ^^^^^^^^^^^^^

    This is a misusage of the term "glyph" here, I believe.
                                                   
    > is a
    > string. So whatever you do, you'll need to be be string processing,
    > treating each character as a variable length unit, so it might as
    > well be a variable 8-bit length unit than 32bit...

    And while it is true that most Unicode text processing, including
    that which is concerned with combining character sequences, is
    inherently *string* processing, rather than code point processing,
    in almost all cases, any process concerned about the semantics
    of the string (and the characters it contains) will need to
    be cognizant of character boundaries. The semantics, which
    you need to access tables for, inhere to the code points, so
    you can't just treat a UTF-8 string as a bag o' bytes for
    processing. (Except for trival operations like string copying,
    string length for buffer size, and so on.)

    >
    > Therefor, I win the discussion. Thank you :)

    We await the decision of the judges. ;-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 18:33:20 CDT