Re: Unicode for words?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 10:21:57 CST

  • Next message: Philippe Verdy: "Re: Nicest UTF"

    From: "Ray Mullan" <ray@mullan.net>
    >I don't see how the one million available codepoints in the Unicode
    >Standard could possibly accommodate a grammatically accurate vocabulary of
    >all the world's languages.

    You have misread the message from Tim: he wanted to use "code points" above
    U+10FFFF within the full 32-bit space (meaning more than 4 billions
    codepoints, when Unicode and ISO-10646 only allow 2 millions...)

    He wanted to use that to encode words on a single code point, as a possible
    compression scheme. But he forgets that words can have its component letters
    affected by style or during rendering.

    Also a "font" or renderer would be unable to draw the text without having
    the equivalent of an indexed dictionnary of all words on the planet!

    If compression is a goal, he forgets that the space gain offered by such
    compression will be very modest face to more generic data compressors like
    deflate or bzip2 that can compress the represented texts more efficiently
    without even needing such large dictionnary (that is in perpetual evolution
    by every speaker of any language, without any prior standard agreement
    anywhere!).

    Forget his idea, it is technically impossible to do. At best you could
    create some protocols that will compact some widely used words (this is what
    WAP does for widely used HTML elements or attributes), but this is still not
    a standard outside of this limited context.

    Suppose that Unicode encodes the common English words "the", "an", "is",
    etc... then a protocol could decide that these words are not important and
    will filter them. What will happen if these "words" do appear in non-English
    languages where they are semantically significant? These words would be
    missing. To paliate this inconvenient the codepoints would only designate
    the words used in one language and not the other, so "an" would have
    different codes whever it is used in English or in another language.

    The last problem is that too many languages do not have well-established and
    computerized lexical dictionnaries, and grammatical rules that allow
    composing words are not always known. The number of words in a single
    language cannot also be bound to a known maximum (a good example in German
    where composed words are virtually unlimited!)

    So forget this idea: Unicode will not create a standard to encode words.
    Words will be represented after modeling them to a script system made of
    simpler sets of "letters" or "ideographs" or punctuation and diacritics. The
    representation of words with those letters is an orthographic system,
    specific to each language, that Unicode will not standardize.



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 10:28:01 CST