Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 11 2004 - 12:06:18 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > Regarding A, I see three choices:
    > 1. A string is a sequence of code points.
    > 2. A string is a sequence of combining character sequences.
    > 3. A string is a sequence of code points, but it's encouraged
    > to process it in groups of combining character sequences.
    >
    > I'm afraid that anything other than a mixture of 1 and 3 is too
    > complicated to be widely used. Almost everybody is representing
    > strings either as code points, or as even lower-level units like
    > UTF-16 units. And while 2 is nice from the user's point of view,
    > it's a nightmare from the programmer's point of view:

    Consider that the normalized forms are trying to approach the choice number
    2, to create more predictable combining character sequences which can still
    be processed with algorithms just streams of code points.
    Remember that the total number of possible code points is finite; but not
    the total number of possible combining sequences, meaning that text handling
    will necessarily have to make decisions based on a limited set of
    properties.

    Note however that for most Unicode strings, the "composite" character
    properties are those of the base character in the sequence. Note also that
    for some languages/scripts, the linguistically correct unit of work is the
    grapheme cluster; Unicode just defines "default grapheme clusters", which
    can span several combining sequences (see for example the Hangul script,
    written with clusters made of multiple combining sequences, where the base
    character is a Unicode jamo, itself made somtimes of multiple simpler jamos
    that Unicode do not allow to decompose as canonically equivalent strings,
    despite this decomposition is inherent of the script itself in its
    structure, and not bound to the language which Unicode will not
    standardize).

    It's hard to create a general model that will work for all scripts encoded
    in Unicode. There are too many differences. So Unicode just appears to
    standardize a higher level of processing with combining sequences and
    normalization forms that are better approaching the linguistic and semantic
    of the scripts. Consider this level as an intermediate tool that will help
    simplify the identification of processing units.

    The reality is that a written language is actually more complex than what
    can be approached in a single definition of processing units. For many other
    similar reasons, the ideal working model will be with "simple" and
    enumerable abstract characters with a finite number of code points, and with
    which actual and non-enumerable characters can be composed.

    But the situation is not ideal for some scripts, notably ideographic ones
    due to their very complex and often "inconsistent" composition rules or
    layout and that require allocating many code points, one for each
    combination. Working with ideographic scripts requires much more character
    properties than with other scripts (see for example the huge and various
    properties defined in UniHan, which are still not standardized due to the
    difficulty to represent them and the slow discovery of errors, omissions, or
    contradictions found in various sources for this data...)



    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 12:10:45 CST