UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Theodore H. Smith (delete@elfdata.com)
Date: Fri Jun 02 2006 - 07:04:50 CDT

  • Next message: Keutgen, Walter: "RE: Unicode, SMS, PDA/cellphones"

    I do all my processing on UTF-8.

    I can do Unicode correct uppercasing and lowercasing on UTF-8
    strings, with no intermediate UTF-16 or UTF-32 stage. I don't even
    process the code points or keep a record of where is the start of a
    character, it's only byte-aware only.

    I did this, using a parallel string replacement algorithm. The nice
    thing about UTF-8, is that assuming the UTF-8 input text is good, and
    the UTF-8 strings I am replacing from and to are good, then this
    cannot corrupt the UTF-8, or cause false matches or false misses.

    Once I figure out how to do Unicode normalisation, at all, (in UTF-32
    or any codepoint size), I'll extend my code to do normalisation
    directly upon UTF-8, once again no code point processing, only byte-
    aware code. Currently I'm still a little confused about normalisation.

    I haven't yet come across anything that needs UTF-32 instead of
    UTF-8, to keep efficient speed. I don't deny that in some situations
    UTF-32 will be quicker, but then how likely are those situations,
    keeping in mind that you usually need to convert to and from UTF-8

    The thing is, seeing as a character, aka Unicode glyph, is a string
    of code-points, you'll need to do string processing and not code-
    point processing anyhow, just to process those glyphs meaningfully.
    So whether it's a string of unsigned longs or a string of unsigned
    bytes, what difference does it make?

    I'll grant you that fast code to do multiple string replacement, and
    do it efficiently, and do all sorts of lovely tree-based dictionary
    like lookups on text, is hard to come by. That might explain why
    people can't imagine that they can process UTF-8 natively, perhaps
    they just don't have the tools for it.

    I don't think the argument that we can waste 4x RAM and disk size is
    a good one. Why buy 4 hard disks when one will do?

    Think of all the ecological burden you put on our planet by rejecting
    UTF-8 :) That RAM has to come from somewhere you know.

    > On Fri, Jun 02, 2006 at 09:23:30AM +0200,
    > Kornkreismuster@web.de <Kornkreismuster@web.de> wrote
    > a message of 15 lines which said:
    >> UTF-32 is for sure a waste of space.
    > This is a very strange argument. We use 8-bits encoding like ASCII for
    > many, many years. Switching to UTF-32 would simply mean multiplying
    > the size of texts by 4 at the maximum, while, in the same time, the
    > hard disks are thousands of times larger!
    > Expect for very specific applications (like SMS, currently discussed
    > on this list), there is no reason to reject UTF-32 for size issues. If
    > someone lacks room on its hard disks, it should delete pictures and
    > films first :-)
    > And there are good reasons, IMHO, to use UTF-32, such as the fact that
    > all the characters have the same size.
    > (For the record, on some programming languages like Python, you can
    > use UTF-32 internally - for Python, it is a compilation option, the
    > default is UTF-16.)

    This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 07:14:09 CDT