Re: Unicode for words?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 14:50:08 CST

  • Next message: Philippe Verdy: "Re: Nicest UTF"

    Don't misinterpret my words or arguments here: the purpose of the question
    was strictly about which UTF or other transformation would be good for
    interoperability, and storage, and whever it would be a good idea to encode
    words with standard codes.

    So in my view, it is completely unneeded to create such "standard" codes for
    common words, if these words are in the natural human language (it may make
    sense for computer languages, but this is specific to the implementation of
    such language, and should be part of its specification rather than being
    standardized in a general purpose encoding like Unicode code points, made to
    fit also all the needs for the representation of human languages, which are
    NOT standardized and constantly evolving.) Creating such standard codes for
    human words would not only be an endless task, but also a work that would
    rapidly become obsoleted, and not based on the very variable uses of human
    languages. Let's keep Unicode simple without attempting to encode words
    (even for Chinese, we encode "ideographic" characters, but not words made
    often of two characters each representing a single syllable).

    If you want to encode words, you create an encoding based on a pictographic
    representation of human languages, and you are going to another way than the
    way followed for a very long history of evolution by the inventors of script
    systems. You would be returning to the first ages of humanity... where men
    had lots of difficulty to understand each other, and difficulties to
    transmit their acquired knowledge.

    This does not exclude other UTF representation to implement algorithms, only
    as an intermediate form which eases the processing. However, you are not
    required to create an actual instance of the other UTF to work with it, and
    there are many examples where you can perfectly work with a compact
    representation that will fit marvelously in memory with excellent
    performance, and where the decompressed form will only be used locally.

    In *many* cases, notably if the text data to manage like this is large,
    adding an object representation with just an API to access to a temporary
    decompressed form, it will improve the global performence of the system, due
    to reduced internal processing resource needs. A code that decompresses SCSU
    to UTF-32 can fit in less than 1KB of memory, but it will allow saving as
    many megabytes of memory as you wish for your large database, given that
    SCSU will take an average of nearly one byte per character (or code point)
    instead of 4 with UTF-32.

    Such examples exist in real-world applications, notably in spelling and
    grammatical correctors, whose performance depend completely on the total
    size of the information thay have in their database, and the level at which
    this information is compressed (to minimize the impact on system resources,
    which is mostly determined by the quantity of information you can fit into
    fast memory without "swapping" between fast memory and slow disk storage).
    The most efficient correctors use very compact forms with very specific
    compression and indexing schemes through a transparent class managing the
    conversion between this compact form and the usual representation of text as
    a linear stream of characters.

    Other examples exist in some RDBMS to allow improve the speed of query
    processing for large databases, or the speed of full-text searches, or in
    their networking connectors to reduce the bandwidth taken by result sets.
    The interest of data compression becomes immediate as soon as the data to
    process must go through any kind of channels (networking links, file
    storage, database table) with lower throughput than fast but expensive or
    restricted internal processing memory (including memory caches if we
    consider data locality).

    From: "D. Starner" <shalesller@writeme.com>
    > "Philippe Verdy" writes:
    >> Suppose that Unicode encodes the common English words "the", "an", "is",
    >> etc... then a protocol
    >> could decide that these words are not important and will filter them.
    >
    > Drop the part of the sentence before "then". A protocol could delete
    > "the", "an", etc. right
    > now. In fact, I suspect several library systems do drop "the", etc. right
    > now. Not that this
    > makes it a good idea, but that's a lousy argument.

    If such a library does this, only based on the presence of the encoded
    words, without wondering in which language the text is written, that kind of
    processing text will be seriously inefficient or inaccurate when processing
    other languages than English for which you will have built such a library.

    For plain-text (which is what Unicode deals about), even the "an", "the",
    "is" words (and so on...) are equally important as other parts of the text.
    Encoding frequent words with a single compact code may be effective for a
    limited set of applications, but it will not be as much effective as a more
    general compression scheme (deflate, bzip2, and so on...) which will work
    best independantly of the language, and without needing (when impelmenting
    text processing functions) a arbitrarily large dictionnary for the
    conversion of these compact codes to the associated plain-text words encoded
    with streams of Unicode-supported characters.



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 14:56:22 CST