Nicest UTF

From: Theodore H. Smith (delete@elfdata.com)
Date: Wed Dec 01 2004 - 16:40:06 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    Assuming you had no legacy code. And no "handy" libraries either,
    except for byte libraries in C (string.h, stdlib.h). Just a C++
    compiler, a "blank page" to draw on, and a requirement to do a lot of
    Unicode text processing.

    Apart from that the real world would still apply, you may want to use
    this code in the future, you may want to use it from within other
    environments, but even those environments wouldn't have legacy code or
    existing Unicode libraries.

    Obviously, if you had existing libraries that made things "easier" (so
    you hope), or legacy code that used UCS2 then things are different, so
    we are ignoring these cases. Its all fresh.

    What would be the nicest UTF to use?

    I think UTF8 would be the nicest UTF.

    Some people say that UTF32 offers more simpler better code, from an
    architectural view-point. After all, every character is one variable in
    RAM.

    But does UTF32 offer simpler better faster cleaner code?

    A Unicode "character" can be decomposed. Meaning that a character could
    still be a few variables of UTF32 code points! You'll still need to
    carry around "strings" of characters, instead of characters.

    The fact that it is totally bloat worthy, isn't so great. Bloat mongers
    aren't your friend.

    The fact that it is incompatible with existing byte code doesn't help.
    UTF8 can be used with the existing byte libraries just fine. The
    "problem" of multiple code points for a decomposed char isn't
    relatively worse, because we still have decomposed char in UTF32.

    An accented A in UTF-8, would be 3 bytes decomposed. In UTF32, thats 8
    bytes!

    Also, UTF-8 is a great file format, or socket-transfer format. Not
    needing to convert is great. Its also compatible with C strings.

    Also, UTF-8 has no endian issues.

    Also, UTF-8's compactness makes it great for processing large volumes
    of UTF-8.

    I think that UTF16 is really bad. UTF16 is basically popular, because
    so many people thought UCS2 was the answer to internationalisation.
    UTF16 was kind of a "switch and bait" technique (unintentional of
    course). Had it been known that we need to treat characters as multiple
    units of variables, we might as well have gone for UTF8!

    The people who like UTF16 because UTF8 takes 3 bytes where UTF16 takes
    2 for their favourite language... I can see their point. But even then,
    with the prevalence of markup, and the prevalence of 1 byte
    punctuation, the trade-off is really quite small. UTF-8 (byte)
    processing code is also more compatible with that Unicode compression
    scheme whose acronym I forget (something like SCSU).

    I think with that compression scheme, the Unicode text would be even
    smaller than UTF16. SCSU (or whatever its called) can be processed as
    markup (XML for example) with no decompression, so its quite handy.

    Anyhow, thats why I think UTF-8 is really the way to go.

    Its too bad MicroSoft and Apple didn't realise the same, before they
    made their silly UCS-2 APIs.

    --
        Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
        Industrial strength string processing code, made easy.
        (If you believe that's an oxymoron, see for yourself.)
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 01 2004 - 16:42:45 CST