Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Jun 04 2006 - 03:59:27 CDT

Next message: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"

Previous message: Donald Z. Osborn: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Asmus Freytag: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 4 Jun 2006, at 02:53, Asmus Freytag wrote:

> All nice advantages. On the other hand, the minute you do text
> processing on the actual text data, such as morphological analysis,

What's that? Like levenshtein? (EditDistance) If you are talking
about a levenshtein-like thing on Unicode, well you can't do it with
codepoint processing, because a character is not a codepoint, a
character is a string of codepoints. So if your "cells" must now be
strings intead of bytes or UInt32s... you might as well use a string
of UTF-8 instead of a string of UTF-32.

> case transformation,

I got case transformation code, running directly on UTF-8.

Once I figure out in theory how to do normalisation on UTF-32, or at
all even, then I'll be able to do it on UTF-8 also.

> linguistically aware search, etc.

I bet I could use my existing code to do this also :) Besides, what
if your code must deal with things like ß == SS in the search? You'll
need a string processing library not a codepoint processing library,
because the equivalence is done on a variable length unit! (I so
happen to have such "variable length unit" processing code on UTF-8 :) )

> you will need to perform an implicit conversion to integral
> character values in order to get at their properties, which you
> will need to drive your algorithm.

But you need string processing to do case transformation, because in
Unicode, characters are sequences of codepoints.

I've actually said the same points, in response to the same points, a
few times, now, in this thread. No one's found any flaw in my points,
no one's yet responded to my offer of me putting up an example on my
web host that does Unicode case transformations, even.

--
http://elfdata.com/plugin/

Next message: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Previous message: Donald Z. Osborn: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Asmus Freytag: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 04:15:50 CDT