Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Philippe Verdy (
Date: Sat Jun 03 2006 - 10:58:41 CDT

  • Next message: Addison Phillips: "RE: UTF-7 - is it dead?"

    From: "Theodore H. Smith" <>
    >> In general, most semantic operations on Unicode strings
    >> require table lookups, and while you can construct table
    >> lookups based directly on UTF-8 byte values, UTF-16 (or
    >> UTF-32) lend themselves to more compact *and* more efficient
    >> table lookups than UTF-8 does.
    > I'm not so sure about more compact lookups, what with a huge range of
    > codepoints. More CPU efficent, perhaps... I've written a nice
    > dictionary algorithm which is perfectly suited to UTF-8 processing,
    > but I can imagine it is still slower than UTF-32 processing.

    If you have looked into how ICU works, you will see that table lookup is not even based strictly on codepoints or in any standard UTF: the external representation (UTF) is converted into multiple indices with variable sizes, and tablesare compacted without depending on any UTF.

    The external representation has absolutely no impact on how text should be processed internally. All UTFs are made for data interchange and storage, they are NOT designed to be the best solution for processing.

    So think twice about which cost is most important: processing time or data retrieval time? The second almost always drives to the solution, and data compaction on external storage, or when keeping lots of text in memory (for example in a word processor or editor handling large documents) gives you the hint about what representaiton should be used (note that for in-memory only storage, the UTFs are not required and not even recommanded as the best solution.

    The support of UTFs tend to become mandatory only in transmissions between systems, i.e. in communication protocols designed to work in heterogenous environment. For all the rest, you have the choice:

    Unicode text handling algorithms are NOT designed in terms of UTFs (not even UTF-32!) but in terms of (virtual) codepoints (and their associated abstract characters).

    These algorithms are independant of the encoding or UTF used (if an UTF is used...) providing that the codepoint identity is preserved with the chosen encoding (this is guaranteed by standard UTFs, or by SCSU and BOCU-1 and representationsdescribed inthe Unicode standard, or in the Sun modified-UTF-8 for its Java/JNI interface or its compiled class serialization format, but it is not always guaranteed by all character encodings and notably legacy ones including GB18030).

    This archive was generated by hypermail 2.1.5 : Sat Jun 03 2006 - 11:29:12 CDT