Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (
Date: Sun Dec 05 2004 - 10:20:12 CST

  • Next message: Philippe Verdy: "Re: Unicode for words?"

    "Philippe Verdy" <> writes:

    >> The point is that indexing should better be O(1).
    > SCSU is also O(1) in terms of indexing complexity...

    It is not. You can't extract the nth code point without scanning the
    previous n-1 code points.

    > But individual characters do not always have any semantic. For
    > languages, the relevant unit is almost always the grapheme cluster,
    > not the character (so not its code point...).

    How do you determine the semantics of a grapheme cluster? Answer: by
    splitting it into code points. A code point is atomic, it's not split
    any more, because there is a finite number of them.

    When a string is exchanged with another application or network
    computer or the OS, it always uses some encoding which is closer to
    code points than to grapheme clusters, no matter if it's UTF-8 or
    UTF-16 or ISO-8859-something. If the string was originally stored as
    an array of grapheme clusters, it would have to be translated to code
    points before further conversion.

    > Which represent will be the best is left to implementers, but I really
    > think that compressed schemes are often introduced to increase the
    > application performances and reduce the needed resources both in
    > memory and for I/O, but also in networking where interoperability
    > across systems and bandwidth optimization are also important design
    > goals...

    UTF-8 is much better for interoperability than SCSU, because it's
    already widely supported and SCSU is not.

    It's also easier to add support for UTF-8 than for SCSU. UTF-8 is
    stateless, SCSU is stateful - this is very important. UTF-8 is easier
    to encode and decode.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 10:26:54 CST