Re: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 07 2003 - 10:27:35 EDT

  • Next message: Kent Karlsson: "RE: Questions on ZWNBS - for line initial holam plus alef"

    On Thursday, August 07, 2003 2:40 AM, Doug Ewell <dewell@adelphia.net> wrote:

    > Kenneth Whistler <kenw at sybase dot com> wrote:
    >
    > > But I challenge you to find anything in the standard that
    > > *prohibits* such sequences from occurring.
    >
    > I've learned that this question of "illegal" or "invalid" character
    > sequences is one of the main distinguishing factors between those who
    > truly understand Unicode and those who are still on the Road to
    > Enlightenment.
    >
    > Very, very few sequences of Unicode characters are truly "invalid" or
    > "illegal." Unpaired surrogates are a rare exception.
    >
    > In almost all cases, a given sequence might give unexpected results
    > (e.g. putting a combining diacritic before the base character) or
    > might be ineffectual (e.g. putting a variation selector before an
    > arbitrary character), but it is still perfectly legal to encode and
    > exchange such a sequence.

    For Unicode itself this is true, but what users want is interoperability
    of the encoded text with accurate rendering rules.
    In practice, this means that any undefined or unpredictable behavior
    will mean lack of interoperability and should not be used.

    The standard should then highly promote what is a /valid/ encoding
    for text with regard of interoperability for all text processing algorithms
    including parsing combining sequences, collation, and computing
    character properties from those /valid/ encoded sequences.

    We don't have to care much if some encoded text considered valid
    under Unicode/ISO-IEC10646 is rendered or processed differently
    or unpredictably, provided that this does not affect common text for
    actual languages.

    In fact the standard specifies that ALL sequences made of code
    points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF
    and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC
    10646, but it does not attempt to assign properties or behavior to
    ALL of these characters or encoded sequences, as this is the job
    of Unicode to specify this behavior.

    If there's something to enhance in the Unicode standard (not in the
    ISO/IEC 10646), it's exactly the specification of interoperable encoded
    sequences. This certainly means that concrete examples for actual
    languages must be documented. Just assigning properties to individual
    ISO/IEC 10646 characters is not enough, and Unicode should
    concentrate more efforts in the actual encoding of text and not only on
    individual characters.

    So for me, the "validity" of text is a ISO/IEC 10646 concept (shared
    now with Unicode versions for the assignment of characters in the
    repertoire), related only to the legally usable code points, and Unicode
    speaks about "well-formed" or "ill-formed" sequences, or about
    "normalized" sequences and transformations that preserve the actual
    text semantics.

    There is no ambiguity in ISO/IEC 10646 for the character assignments.
    But composed sequences are the real problem, for which Unicode
    must seek agreements: the W3C character model is only based on
    the simplified combining sequences, but Unicode should go further
    with much more precise rules for the encoding of actual text, even
    before any attempt to describe other transformation algorithms (only
    the NF* transformations have for now a stability policy, but actual
    text writers need also stability for the text composition rules for
    actual languages.

    We certainly don't need more assigned code points for existing
    scripts. But more rules for the actual representation of text using
    these scripts, and how distinct scripts can interact and be mixed.
    There's some rules already specified for Combining jamos, or
    combining Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana,
    but we are still far from an agreement for Hebrew, and even for some
    Han composed sequences, which still lack a specification needed
    for interoperability.

    The current wording of "Unicode validity" is for me very weak, and
    probably defective. What it designates is only a ISO10646 validity
    for used code points, and the validity of their UTF* transformations,
    based on individual code points. The kind of validity rules users
    want with Unicode is a conformance of the actually encoded scripts
    for actual languages, for interoperability and data exchange.

    The fact that Unicode is born by trying to maximize the roundtrip
    convertibility with legacy codepages or encoded character sets has
    introduced many difficulties: first the base+combining characters
    model was introduced as fundamental for alphabetized scripts with
    separate letters for vowels. Then there's the case of Brahmic scripts
    which complicates things, as Unicode has chosen to support both
    the ISCII standard model with nuktas and viramas in logical encoding
    order, and the TIS620 model for Thai and Lao with a physical model.
    On the opposite the combining jamos model is remarkably simple,
    and it still follows the logical model shared by alphabetized scripts.

    Looking now at the difficulties of encoding Tengwar reveals most of
    the difficulties that already exist for Thai, and now Hebrew, and subtle
    needed artefacts needed in existing scripts used to transliterate
    foreign languages. Some of these difficulties are also affecting now
    the general alphabetized scripts (Latin notably), showing that the
    ummutable model used to encode base letters and diacritics is not
    universal. So Unicode will need to extend and specify much more its
    own character model to support more scripts and languages, including
    in the case of transliterations.

    May be in the future, this will lead to defining a new level of conformance
    by defining something that is more precise than just some basic
    canonical equivalence rules (for NF* transforms and XML), with more
    precise definitions of "ill-formed" or "defective" sequences (I confess
    that I do not understand the need to deferentiate both concepts, and
    this current separation is really more confusive than helpful to
    understand the Unicode standard). What this means, is that we need
    something saying "Unicode valid text" and not just "Unicode encoded
    text" which just relates to the shared assignment of code points to
    individual characters. The current "valid" term should be left to the
    ISO/IEC 10646 standard, and to the very few Unicode algorithms
    that handle only individual code points (such as UTF* encoding
    forms and schemes), but its current definition is not helping
    implementers and writers to produce interoperable textual data.

    If the term "valid" cannot be changed, then I suggest defining
    "conforming" for encoded text independantly of its validity (a
    "conforming text" would still need to use a "valid encoding").

    -- 
    Philippe.
    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.
    


    This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 11:18:58 EDT