Re: Nicest UTF

From: D. Starner (
Date: Mon Dec 06 2004 - 20:03:52 CST

  • Next message: Mark E. Shoulson: "Re: [hebrew] Re: proposals I wrote (and also, didn't write)"

    (Sorry for sending this twice, Marcin.)

    "Marcin 'Qrczak' Kowalczyk" writes:
    > UTF-8 is poorly suitable for internal processing of strings in a
    > modern programming language (i.e. one which doesn't already have a
    > pile of legacy functions working of bytes, but which can be designed
    > to make Unicode convenient at all). It's because code points have
    > variable lengths in bytes, so extracting individual characters is
    > almost meaningless (unless you care only about the ASCII subset, and
    > sequences of all other characters are treated as non-interpreted bags
    > of bytes). You can't even have a correct equivalent of C isspace().
    That's assuming that the programming language is similar to C and Ada.
    If you're talking about a language that hides the structure of strings
    and has no problem with variable length data, then it wouldn't matter
    what the internal processing of the string looks like. You'd need to
    use iterators and discourage the use of arbitrary indexing, but arbitrary
    indexing is rarely important.
    You could hide combining characters, which would be extremely useful if
    we were just using Latin and Cyrillic scripts. You'd have to be flexible,
    since it would be natural to step through a Hebrew or Arabic string as if the
    vowels were written inline, and people might want to look at the combining
    characters (which would be incredibly rare if your language already
    provided most standard Unicode functions.)

    Sign-up for Ads Free at

    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 20:06:21 CST