Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Dec 05 2004 - 14:28:22 CST

  • Next message: Hohberger, Clive: "RE: Unicode for words?"

    "Philippe Verdy" <verdy_p@wanadoo.fr> writes:

    > The question is why you would need to extract the nth codepoint so
    > blindly.

    For example I'm scanning a string backwards (to remove '\n' at the
    end, to find and display the last N lines of a buffer, to find the
    last '/' or last '.' in a file name). SCSU in general supports
    traversal only forwards.

    > But remember the context in which this discussion was introduced:
    > which UTF would be the best to represent (and store) large sets of
    > immutable strings. The discussion about indexes in substrings is not
    > relevevant in that context.

    It is relevant. A general purpose string representation should support
    at least a bidirectional iterator, or preferably efficient random access.
    Neither is possible with SCSU.

    * * *

    Now consider scanning forwards. We want to strip a beginning of a
    string. For example the string is an irc message prefixed with a
    command and we want to take the message only for further processing.
    We have found the end of the prefix and we want to produce a string
    from this position to the end (a copy, since strings are immutable).

    With any stateless encoding a suitable library function will compute
    the length of the result, allocate memory, and do an equivalent of
    memcpy.

    With SCSU it's not possible to copy the string without analysing it
    because the prefix might have changed the state, so the suffix is not
    correct when treated as a standalone string. If the stripped part is
    short and the remaining part is long, it might pay off to scan the
    part we want to strip and perform a shortcut of memcpy if the prefix
    did not change the state (which is probably a common case). But in
    general we must recompress the whole copied part! We can't even
    precalculate its physical size. Decompressing into temporary memory
    will negate benefits of a compressed encoding, so we should better
    decompress and compress in parallel into a dynamically resizing
    buffer. This is ridiculously complex compared to a memcpy.

    The *only* advantage of SCSU is that it takes little space. Although
    in most programs most strings are ASCII, and SCSU never beats
    ISO-8859-1 which is what the implementation of my language is using
    for strings which no characters above U+00FF, so it usually does
    not have even this advantage.

    Disadvantages are everywhere else: every operation which looks at the
    contents of a string or produces contents of a string is more complex.
    Some operations can't be supported at all with the same asymptotic
    complexity, so the API would have to be changed as well to use opaque
    iterators instead of indices. It's more complicated both for internal
    processing and for interoperability (unless the other end understands
    SCSU too, which is unlikely).

    Plain immutable character arrays are not completely universal either
    (e.g. they are not sufficient for a buffer of a text editor), but they
    are appropriate as the default representation for common cases; for
    representing filenames, URLs, email addresses, computer language
    identifiers, command line option names, lines of a text file, messages
    in a dialog in a GUI, names of columns of a database table etc. Most
    strings are short and thus performing a physical copy when extracting
    a substring is not disastrous. But the complexity of SCSU is too bad.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 14:35:32 CST