Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Theodore H. Smith (
Date: Sun Jun 04 2006 - 03:59:27 CDT

  • Next message: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"

    On 4 Jun 2006, at 02:53, Asmus Freytag wrote:

    > All nice advantages. On the other hand, the minute you do text
    > processing on the actual text data, such as morphological analysis,

    What's that? Like levenshtein? (EditDistance) If you are talking
    about a levenshtein-like thing on Unicode, well you can't do it with
    codepoint processing, because a character is not a codepoint, a
    character is a string of codepoints. So if your "cells" must now be
    strings intead of bytes or UInt32s... you might as well use a string
    of UTF-8 instead of a string of UTF-32.

    > case transformation,

    I got case transformation code, running directly on UTF-8.

    Once I figure out in theory how to do normalisation on UTF-32, or at
    all even, then I'll be able to do it on UTF-8 also.

    > linguistically aware search, etc.

    I bet I could use my existing code to do this also :) Besides, what
    if your code must deal with things like == SS in the search? You'll
    need a string processing library not a codepoint processing library,
    because the equivalence is done on a variable length unit! (I so
    happen to have such "variable length unit" processing code on UTF-8 :) )

    > you will need to perform an implicit conversion to integral
    > character values in order to get at their properties, which you
    > will need to drive your algorithm.

    But you need string processing to do case transformation, because in
    Unicode, characters are sequences of codepoints.

    I've actually said the same points, in response to the same points, a
    few times, now, in this thread. No one's found any flaw in my points,
    no one's yet responded to my offer of me putting up an example on my
    web host that does Unicode case transformations, even.


    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 04:15:50 CDT