From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 07:28:01 CST
From: "Doug Ewell" <dewell@adelphia.net>
> I appreciate Philippe's support of SCSU, but I don't think *even I*
> would recommend it as an internal storage format. The effort to encode
> and decode it, while by no means Herculean as often perceived, is not
> trivial once you step outside Latin-1.
I said: "for immutable strings", which means that these Strings are
instanciated for long term, and multiple reuses. In that sense, what is
really significant is its decoding, not the effort to encode it (which is
minimal for ISO-8859-1 encoded source texts, or Unicode UTF-encoded texts
that only use characters from the first page).
Decoding SCSU is very straightforward, even if this is stateful (at the
internal character level). But for immutable strings, there's no need to
handle various initial states, and the states associated with each conponent
character of the string has no importance (strings being immutable, only the
decoding of the string as a whole makes sense).
The stateful decoding of SCSU can be part of an accessor from a storage
class, which can also be optimized easily to avoid multiple reallocations of
the decoded buffer.
SCSU can only be a complication if you want mutable strings; however mutable
strings are needed only if you intend to transform a source text and work on
its content. If this is a temporary need to create other immutable strings,
you can still use SCSU for encoding the final results, and work with UTFs
for intermediate results.
In a text editor, where you'll constantly need to work at the character
level, the text is not immutable, and this is effectively not a good
encoding for working on it (but all UTFs, including UTF-8 or GB18030) are
easy to work with at this level.
In practice, a text editor often needs to split the edited text into
manageable fragments encoded separately, for performance reason (as text
insertion and deletion in a large buffer is a lengthy and costly operation).
Given that UTFs can increase the memory need, it is not completly stupid to
think about using a compression scheme for individual fragments of the large
text file; the cost of encoding/decoding SCSU, if this limits the number VM
swaps to the disk to access to more fragments, can be an interesting
optimization, as the total size on disk will be smaller, reducing the number
of I/O operations, and so enhancing the program responsiveness to user
commands.
(Note that there already exists applications of such compression schemes
even within filesystems that support editable but still compressed files...
SCSU is not the option used in this case, because it is too specific to
Unicode texts, but they use a much more complex compression scheme, most
often derived from Lempel-Ziv-Welsh compression algorithms, and this is not
significantly increasing the total load time, given that this also
significantly reduces the frequency of disk I/O, which is a much longer and
costly operation...)
The bad thing about SCSU is that the compression scheme is not
deterministic: you can't compare easily too instances of strings encoded
with SCSU (because several alternatives are possible) without actually
decoding it prior to performing their collation (with standard UTFs,
including the chinese GB18030 standard, the encoding is deterministic and
allows comparing encoded strings without first decoding them).
But this argument is also true for almost all compression schemes, even for
the well-known "deflate" algorithm or for very basic compressors like RLE,
or a newer "bzip2" compression (depending on the compressor implementation
used and some tunable parameters, and the number of alternatives and size of
internal dictionaries considered during the compression).
The advantage of SCSU over generic data compressors like "deflate" is that
it does not require a large and complex state (all the SCSU decoding states
are managed with a very limited number of fixed-sized variables), so its
decompression can be easily hardcoded and optimized a lot, up to a point
were the cost of decompression will be nearly invisible to almost all
applications: the most significant costs will be most often within collators
or text parsers; a compliant UCA collation algorithm is much more complex to
implement and optimize than a SCSU decompressor, and it is more CPU- and
resource-intensive.
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 15:44:57 CST