SCSU as internal encoding (was: Re: Nicest UTF)

From: Doug Ewell (
Date: Sun Dec 05 2004 - 21:40:42 CST

  • Next message: Doug Ewell: "Re: Nicest UTF"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> The point is that indexing should better be O(1).
    > SCSU is also O(1) in terms of indexing complexity... simply because it
    > keeps the exact equivalence with codepoints, and requires a *fixed*
    > (and small) number of steps to decode it to code points, but also
    > because the decoder states uses a *fixed* (and small) number of
    > variables for the internal context (unlike more powerful compression
    > algorithms like dictionnary-based, Lempel-Ziv-Welsh-like, algorithms
    > such as deflate).

    As Marcin said, SCSU is O(n) in terms of indexing complexity, because
    you have to decode the first (n - 1) characters before you can decode
    the n'th. Even when you have a run of "ASCII" bytes between 0x20 and
    0x7E, there is no guarantee that the characters are Basic Latin. There
    might have been a previous SCU tag that switched into Unicode mode.

    >> No, individual characters are immutable in almost every language.
    > But individual characters do not always have any semantic. For
    > languages, the relevant unit is almost always the grapheme cluster,
    > not the character (so not its code point...). As grapheme clusters
    > need to be represented on variable lengths, an algorithm that could
    > only work with fixed-width units would not work internationaly or
    > would cause serious problems for correct analysis or transformation of
    > true languages.

    This is beside the point, as I said at the outset. In programming, you
    have to deal with individual characters in a string on a regular basis,
    even if some characters depend on others from a linguistic standpoint.

    > Code points are probably the easiest thing to describe what an text
    > algorithm is supposed to do, but this is not a requirement for
    > applications (in fact many libraries have been written that correctly
    > implement the Unicode algorithms, without even dealing with code
    > points, but only with in-memory code units of UTF-16 or even in UTF-8
    > or GB18030, or directly with serialization bytes of UTF-16LE or UTF-8
    > or SCSU or ether encoding schemes).

    Algorithms that operate on CES-specific code units are what lead to such
    "wonderful" innovations as CESU-8. All text operations, except for
    encoding and decoding, should work with code points.

    Marcin <qrczak at knm dot org dot pl> responded:

    > UTF-8 is much better for interoperability than SCSU, because it's
    > already widely supported and SCSU is not.

    True, but not really Philippe's point.

    Philippe again:

    > The question is why you would need to extract the nth codepoint so
    > blindly. If you have such reasons, because you know the context in
    > which this index is valid and usable, then you can as well extract a
    > sequence using an index in the SCSU encoding itself using the same
    > knowledge.
    > Linguistically, extracting a substring or characters at any random
    > index in a sequence of code points will only cause you problems. In
    > general, you will more likely use index as a way to mark a known
    > position that you have already parsed sequentially in the past.

    You have to do this ALL THE TIME in programming.

    Example: searching and replacing text. To search a string for a
    substring, you would normally write a function that would not only give
    a yes/no answer (i.e. "this string does/does not contain the
    substring"), but would also indicate *where* the substring was found
    within the string. That's because the world needs not only search
    tools, but also search-and-replace tools, and you need to know where the
    substring is in order to replace it with another. "Linguistically" has
    nothing to do with it. Nothing prevents the user of a
    search-and-replace tool from doing something linguistically unsound, nor
    should it.

    If you do this in SCSU, you have to keep track of the state of the
    decoder within the string (single-byte vs. Unicode mode, current dynamic
    window, and position of all dynamic windows). If you lose track of the
    decoder state, you run the risk of corrupting the data. (Philippe
    acknowledged this in his next paragraph.) You really need to convert
    internally to code points in order to do this. I'm a believer in SCSU
    as an efficient storage and transfer encoding, but not as an internal
    process code.

    > All those are not demonstration: decoding IRC commands or similar
    > things does not constitute the need to encode large sets of texts. In
    > your examples, you show applications that need to handle locally some
    > strings made for computer languages.

    One of the main stated goals of SCSU was to provide good compression for
    small strings.

    > Texts of human languages, or even a collection of person names, or
    > places are not like this, and have a much wider variety, but with huge
    > possibilities for data compression (inherent to the phonology of human
    > languages and their overall structure, but also due to repetitive
    > conventions spread throughout the text to allow easier reading and
    > understanding).

    This is where general-purpose compression schemes excel, and should be
    considered. (You might want to read UTN #14 after all.)

    > My conclusion: there's no "best" representation to fit all needs. Each
    > representation has its merits in its domain. The Unicode UTFs are
    > excellent only for local processing of limited texts, but they are not
    > necessarily the best for long term storage or for large text sets.

    I agree completely.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 21:43:12 CST