From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 14:50:08 CST
Don't misinterpret my words or arguments here: the purpose of the question
was strictly about which UTF or other transformation would be good for
interoperability, and storage, and whever it would be a good idea to encode
words with standard codes.
So in my view, it is completely unneeded to create such "standard" codes for
common words, if these words are in the natural human language (it may make
sense for computer languages, but this is specific to the implementation of
such language, and should be part of its specification rather than being
standardized in a general purpose encoding like Unicode code points, made to
fit also all the needs for the representation of human languages, which are
NOT standardized and constantly evolving.) Creating such standard codes for
human words would not only be an endless task, but also a work that would
rapidly become obsoleted, and not based on the very variable uses of human
languages. Let's keep Unicode simple without attempting to encode words
(even for Chinese, we encode "ideographic" characters, but not words made
often of two characters each representing a single syllable).
If you want to encode words, you create an encoding based on a pictographic
representation of human languages, and you are going to another way than the
way followed for a very long history of evolution by the inventors of script
systems. You would be returning to the first ages of humanity... where men
had lots of difficulty to understand each other, and difficulties to
transmit their acquired knowledge.
This does not exclude other UTF representation to implement algorithms, only
as an intermediate form which eases the processing. However, you are not
required to create an actual instance of the other UTF to work with it, and
there are many examples where you can perfectly work with a compact
representation that will fit marvelously in memory with excellent
performance, and where the decompressed form will only be used locally.
In *many* cases, notably if the text data to manage like this is large,
adding an object representation with just an API to access to a temporary
decompressed form, it will improve the global performence of the system, due
to reduced internal processing resource needs. A code that decompresses SCSU
to UTF-32 can fit in less than 1KB of memory, but it will allow saving as
many megabytes of memory as you wish for your large database, given that
SCSU will take an average of nearly one byte per character (or code point)
instead of 4 with UTF-32.
Such examples exist in real-world applications, notably in spelling and
grammatical correctors, whose performance depend completely on the total
size of the information thay have in their database, and the level at which
this information is compressed (to minimize the impact on system resources,
which is mostly determined by the quantity of information you can fit into
fast memory without "swapping" between fast memory and slow disk storage).
The most efficient correctors use very compact forms with very specific
compression and indexing schemes through a transparent class managing the
conversion between this compact form and the usual representation of text as
a linear stream of characters.
Other examples exist in some RDBMS to allow improve the speed of query
processing for large databases, or the speed of full-text searches, or in
their networking connectors to reduce the bandwidth taken by result sets.
The interest of data compression becomes immediate as soon as the data to
process must go through any kind of channels (networking links, file
storage, database table) with lower throughput than fast but expensive or
restricted internal processing memory (including memory caches if we
consider data locality).
From: "D. Starner" <shalesller@writeme.com>
> "Philippe Verdy" writes:
>> Suppose that Unicode encodes the common English words "the", "an", "is",
>> etc... then a protocol
>> could decide that these words are not important and will filter them.
>
> Drop the part of the sentence before "then". A protocol could delete
> "the", "an", etc. right
> now. In fact, I suspect several library systems do drop "the", etc. right
> now. Not that this
> makes it a good idea, but that's a lousy argument.
If such a library does this, only based on the presence of the encoded
words, without wondering in which language the text is written, that kind of
processing text will be seriously inefficient or inaccurate when processing
other languages than English for which you will have built such a library.
For plain-text (which is what Unicode deals about), even the "an", "the",
"is" words (and so on...) are equally important as other parts of the text.
Encoding frequent words with a single compact code may be effective for a
limited set of applications, but it will not be as much effective as a more
general compression scheme (deflate, bzip2, and so on...) which will work
best independantly of the language, and without needing (when impelmenting
text processing functions) a arbitrarily large dictionnary for the
conversion of these compact codes to the associated plain-text words encoded
with streams of Unicode-supported characters.
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 14:56:22 CST