Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Theodore H. Smith (
Date: Sat Jun 03 2006 - 06:56:59 CDT

  • Next message: Philippe Verdy: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"

    >> and do it CPU efficiently and far more space efficiently.
    > This is not at all an obvious claim, and it is pretty easy
    > to come up with trivial support or counterexamples, depending
    > on the type of data you are dealing with.
    > If your data is all from the ASCII subset of Unicode, then
    > of course UTF-8 is going to be far more space efficient, and
    > you would also be able to optimize down all kinds of operations
    > to their ASCII equivalent, to be much faster than operations
    > that have to handle all of Unicode.
    > If your data is all Chinese, then UTF-8 is guaranteed to
    > be *less* efficient, both for space and for CPU, for almost
    > any operation, than UTF-16 would be.

    It could be the case for certain languages that UTF-8 is less
    efficient, yes. However, just because some data is in Chinese does
    not mean that UTF-8 is bigger. What if the chinese is in a web page?
    Then we have all sorts of ASCII HTML which could outweigh the CJK.

    > In general, most semantic operations on Unicode strings
    > require table lookups, and while you can construct table
    > lookups based directly on UTF-8 byte values, UTF-16 (or
    > UTF-32) lend themselves to more compact *and* more efficient
    > table lookups than UTF-8 does.

    I'm not so sure about more compact lookups, what with a huge range of
    codepoints. More CPU efficent, perhaps... I've written a nice
    dictionary algorithm which is perfectly suited to UTF-8 processing,
    but I can imagine it is still slower than UTF-32 processing.

    Basically, my dictionary algorithm is able to find "the longest key"
    at a certain position within a string, that exists within the
    dictionary. People have been doing this with "trie" algorithms for
    decades now. I've just made a very space efficient version of a "trie".

    Let's say my dictionary contained "hen", "henry", and "hemp". If I
    had the string "my name is henry smith", and I was checking for a key
    at byte position 12..., it would return "henry" as the key. It does
    this efficiently, due to it's tree structure.

    It's this kind of tree structure that is perfect for processing
    UTF-8. My version of the trie structure, is just more space efficient
    than most trie structures we are used to.

    And because no UTF-8 codepoint exists within another UTF-8 codepoint,
    my trie-like structure can just contain bytes, and do everything in
    byte-mode, and never make a mistake.

    Now, if a deformed UTF-8 codepoint was written that could exist
    within another UTF8 codepoint, my algorithm would fail for processing

    This is why I thank the designers of UTF-8 everyday :)

    >> And the other point is that a character (aka unicode glyph)

    > This is a misusage of the term "glyph" here, I believe.


    >> is a
    >> string. So whatever you do, you'll need to be be string processing,
    >> treating each character as a variable length unit, so it might as
    >> well be a variable 8-bit length unit than 32bit...
    > And while it is true that most Unicode text processing, including
    > that which is concerned with combining character sequences, is
    > inherently *string* processing, rather than code point processing,
    > in almost all cases, any process concerned about the semantics
    > of the string (and the characters it contains) will need to
    > be cognizant of character boundaries.

    My code implicitly deals with character boundaries, due to the nice
    properties of UTF-8. Given valid UTF-8 input data, it's impossible to
    get a character boundary wrong with my byte oriented algorithm.

    > The semantics, which
    > you need to access tables for, inhere to the code points, so
    > you can't just treat a UTF-8 string as a bag o' bytes for
    > processing.

    Not if you did it dictionary-like. Instead of having a codepoint
    table for attributes of a character, you could have a string
    dictionary for with values which are attributes of the character.

    > (Except for trival operations like string copying,
    > string length for buffer size, and so on.)

    But I already said I have Unicode correct upper casing and
    lowercasing code on UTF-8.

    I'll tell you what.

    What if I compile my source code and put it on my server host, to do
    uppercasing and lowercasing of UTF-8? And then post the address here.
    I'm no web monkey, more of a desktop developer, but I can probably
    handle an uppercase and lowercase button and a text field :)

    Would that prove to you that you can do uppercasing and lowercasing
    on UTF-8 without worrying about the codepoints?

    It'll mean that I need to get out my linux box (which I rarely use),
    and compile my source for Linux as well as write up a little stdio
    based app in C++... but it should be worth it as this discussion is
    one I have often, and people just can't believe that I have code that
    can do uppercaseing and lowercasing on UTF-8. Maybe if I showed them,
    they'd see that I actually DO have the code I said I do.

    >> Therefor, I win the discussion. Thank you :)
    > We await the decision of the judges. ;-)

    This archive was generated by hypermail 2.1.5 : Sat Jun 03 2006 - 07:16:07 CDT