At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
>Unicode is a character set. Period.
Well, maybe. But in a much broader sense then the character sets it subsumes in its listings. Each character has numerous properties in Unicode, whereas they generally don't in legacy character sets.
Maybe Unicode is more of a shared set of rules that apply to low level data structures surrounding text and its algorithms then a character set.
The Unicode consortium very wisely keeps it's focus narrow. It provides
>a mechanism for specifying characters. Not for manipulating them, not
>for describing them, not for making them twinkle.
All true, except for some special cases (BOM, bidi issues and algoirthms, vertical variants, etc).Not saying those shouldn't be in there, just that they are useful only in the use of algorithms that are explicit (bi-di) or assumed (upper case/lower case, vertical/horizontal) etc.
In many cases, these algorthms are not well known, even amongst the cognoscenti, or generally available in nice libraries. Anyone for an open source Japanese word splitting library (I know not taking a look at ICU before I press send is going to come back to haunt me on this, but if it is in there, then substitute something that isn't :)
This archive was generated by hypermail 2.1.2 : Fri Jul 12 2002 - 13:15:13 EDT