From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Sun Dec 05 2004 - 03:44:40 CST
On Dec 5, 2004, at 12:27 AM, Tim Finney wrote:
> my co-worker suggested encoding entire words in Unicode.
The "word" is considerably less well-defined than the character. The
set of words is open-ended. If you'd like to see where you go when you
start trying to encode words, take a look at CJK Extension B. CJK
ideographs are much like words, in that they are both comprised of more
basic units. English words are composed of letters, while ideographs
are composed of strokes. If you encode only higher level constructs,
then you must address the issue of input/indexing via lower-level
units. So, there's no way to escape from defining the lower-level
units. If you mean to suggest encoding words as shorthand for sequences
of encoded low-level units, that might work for very specific,
well-defined purposes. But whenever someone creates a neologism (and
word-creation is an on-going process in all living languages), you need
to revisit the encoding process, and encode a new unit. This is
burdensome, to say the least. I think that most people who work on
encoding like to imagine that it is mostly a finite task. Maintenance
of the standard is infinite, but encoding should taper off,
comparatively, over time. Except for encoding of CJK ideographs.
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 03:45:34 CST