Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 07:28:01 CST

  • Next message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"

    From: "Doug Ewell" <dewell@adelphia.net>
    > I appreciate Philippe's support of SCSU, but I don't think *even I*
    > would recommend it as an internal storage format. The effort to encode
    > and decode it, while by no means Herculean as often perceived, is not
    > trivial once you step outside Latin-1.

    I said: "for immutable strings", which means that these Strings are
    instanciated for long term, and multiple reuses. In that sense, what is
    really significant is its decoding, not the effort to encode it (which is
    minimal for ISO-8859-1 encoded source texts, or Unicode UTF-encoded texts
    that only use characters from the first page).

    Decoding SCSU is very straightforward, even if this is stateful (at the
    internal character level). But for immutable strings, there's no need to
    handle various initial states, and the states associated with each conponent
    character of the string has no importance (strings being immutable, only the
    decoding of the string as a whole makes sense).

    The stateful decoding of SCSU can be part of an accessor from a storage
    class, which can also be optimized easily to avoid multiple reallocations of
    the decoded buffer.

    SCSU can only be a complication if you want mutable strings; however mutable
    strings are needed only if you intend to transform a source text and work on
    its content. If this is a temporary need to create other immutable strings,
    you can still use SCSU for encoding the final results, and work with UTFs
    for intermediate results.

    In a text editor, where you'll constantly need to work at the character
    level, the text is not immutable, and this is effectively not a good
    encoding for working on it (but all UTFs, including UTF-8 or GB18030) are
    easy to work with at this level.

    In practice, a text editor often needs to split the edited text into
    manageable fragments encoded separately, for performance reason (as text
    insertion and deletion in a large buffer is a lengthy and costly operation).
    Given that UTFs can increase the memory need, it is not completly stupid to
    think about using a compression scheme for individual fragments of the large
    text file; the cost of encoding/decoding SCSU, if this limits the number VM
    swaps to the disk to access to more fragments, can be an interesting
    optimization, as the total size on disk will be smaller, reducing the number
    of I/O operations, and so enhancing the program responsiveness to user
    commands.

    (Note that there already exists applications of such compression schemes
    even within filesystems that support editable but still compressed files...
    SCSU is not the option used in this case, because it is too specific to
    Unicode texts, but they use a much more complex compression scheme, most
    often derived from Lempel-Ziv-Welsh compression algorithms, and this is not
    significantly increasing the total load time, given that this also
    significantly reduces the frequency of disk I/O, which is a much longer and
    costly operation...)

    The bad thing about SCSU is that the compression scheme is not
    deterministic: you can't compare easily too instances of strings encoded
    with SCSU (because several alternatives are possible) without actually
    decoding it prior to performing their collation (with standard UTFs,
    including the chinese GB18030 standard, the encoding is deterministic and
    allows comparing encoded strings without first decoding them).

    But this argument is also true for almost all compression schemes, even for
    the well-known "deflate" algorithm or for very basic compressors like RLE,
    or a newer "bzip2" compression (depending on the compressor implementation
    used and some tunable parameters, and the number of alternatives and size of
    internal dictionaries considered during the compression).

    The advantage of SCSU over generic data compressors like "deflate" is that
    it does not require a large and complex state (all the SCSU decoding states
    are managed with a very limited number of fixed-sized variables), so its
    decompression can be easily hardcoded and optimized a lot, up to a point
    were the cost of decompression will be nearly invisible to almost all
    applications: the most significant costs will be most often within collators
    or text parsers; a compliant UCA collation algorithm is much more complex to
    implement and optimize than a SCSU decompressor, and it is more CPU- and
    resource-intensive.



    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 15:44:57 CST