Re: Unicode for words?

From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Sun Dec 05 2004 - 03:44:40 CST

  • Next message: Ray Mullan: "Re: Unicode for words?"

    On Dec 5, 2004, at 12:27 AM, Tim Finney wrote:

    > my co-worker suggested encoding entire words in Unicode.

    The "word" is considerably less well-defined than the character. The
    set of words is open-ended. If you'd like to see where you go when you
    start trying to encode words, take a look at CJK Extension B. CJK
    ideographs are much like words, in that they are both comprised of more
    basic units. English words are composed of letters, while ideographs
    are composed of strokes. If you encode only higher level constructs,
    then you must address the issue of input/indexing via lower-level
    units. So, there's no way to escape from defining the lower-level
    units. If you mean to suggest encoding words as shorthand for sequences
    of encoded low-level units, that might work for very specific,
    well-defined purposes. But whenever someone creates a neologism (and
    word-creation is an on-going process in all living languages), you need
    to revisit the encoding process, and encode a new unit. This is
    burdensome, to say the least. I think that most people who work on
    encoding like to imagine that it is mostly a finite task. Maintenance
    of the standard is infinite, but encoding should taper off,
    comparatively, over time. Except for encoding of CJK ideographs.



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 03:45:34 CST