From: Philippe Verdy (email@example.com)
Date: Sat Dec 11 2004 - 12:06:18 CST
From: "Marcin 'Qrczak' Kowalczyk" <firstname.lastname@example.org>
> Regarding A, I see three choices:
> 1. A string is a sequence of code points.
> 2. A string is a sequence of combining character sequences.
> 3. A string is a sequence of code points, but it's encouraged
> to process it in groups of combining character sequences.
> I'm afraid that anything other than a mixture of 1 and 3 is too
> complicated to be widely used. Almost everybody is representing
> strings either as code points, or as even lower-level units like
> UTF-16 units. And while 2 is nice from the user's point of view,
> it's a nightmare from the programmer's point of view:
Consider that the normalized forms are trying to approach the choice number
2, to create more predictable combining character sequences which can still
be processed with algorithms just streams of code points.
Remember that the total number of possible code points is finite; but not
the total number of possible combining sequences, meaning that text handling
will necessarily have to make decisions based on a limited set of
Note however that for most Unicode strings, the "composite" character
properties are those of the base character in the sequence. Note also that
for some languages/scripts, the linguistically correct unit of work is the
grapheme cluster; Unicode just defines "default grapheme clusters", which
can span several combining sequences (see for example the Hangul script,
written with clusters made of multiple combining sequences, where the base
character is a Unicode jamo, itself made somtimes of multiple simpler jamos
that Unicode do not allow to decompose as canonically equivalent strings,
despite this decomposition is inherent of the script itself in its
structure, and not bound to the language which Unicode will not
It's hard to create a general model that will work for all scripts encoded
in Unicode. There are too many differences. So Unicode just appears to
standardize a higher level of processing with combining sequences and
normalization forms that are better approaching the linguistic and semantic
of the scripts. Consider this level as an intermediate tool that will help
simplify the identification of processing units.
The reality is that a written language is actually more complex than what
can be approached in a single definition of processing units. For many other
similar reasons, the ideal working model will be with "simple" and
enumerable abstract characters with a finite number of code points, and with
which actual and non-enumerable characters can be composed.
But the situation is not ideal for some scripts, notably ideographic ones
due to their very complex and often "inconsistent" composition rules or
layout and that require allocating many code points, one for each
combination. Working with ideographic scripts requires much more character
properties than with other scripts (see for example the huge and various
properties defined in UniHan, which are still not standardized due to the
difficulty to represent them and the slow discovery of errors, omissions, or
contradictions found in various sources for this data...)
This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 12:10:45 CST