Re: Conflicting principles

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 08 2003 - 14:37:07 EDT

  • Next message: Peter Kirk: "Re: Conflicting principles"

    Philippe Verdy asked:

    > > Ken's point of course is that however bizarre the backing store for
    > > Sindarin and English Tengwar modes may be, combining characters per
    > > se must follow their base characters no matter what.
    >
    > Even if that breaks the logical analysis of text?

    Yes. And that is the challenge for encoding such a script, to ensure
    that analysis of certain characters as combining does not result in
    undue complexity in the logical analysis of the textual representation,
    once the repertoire of encoded characters is decided.

    > How does the Sindarin mode affect the line or word breaking rule for example:
    > suppose that the combining character is coded after the next logical
    > base character, would it be valid to break at this base character
    > and thus send the combining vowel to the next line,

    No. Not if you are following the recommendations of UAX #14:

    "Combining character sequences are treated as units for the purposes of
    line breaking."

    > where in fact what is intended is to use a vowel carier for
    > the combining character logically attached to the previous base character?
    >
    > I don't know Tengwar's Sindarin mode enough to see how word
    > breaking can affect the interpretation of text. But preserving
    > the logical ordering of letters seems much more important for
    > actual text encoding than just being constrained by combining
    > rules that were created taking into account only the first
    > encoded scripts for Latin, Greek, Cyrillic, Hebrew, Arabic
    > and Hiragana/Katakana scripts that use combining characters.

    Nonsense. The order of base character and combining character is
    fundamental to the standard. If this creates a problem for a
    particular approach to encoding Tengwar, then the solution is
    to go back to the drawing board and rethink the encoding proposal
    for Tengwar -- not to call the order of combining marks in the
    standard into question.

    > There will certainly not be a huge revolution in writing systems
    > (families of scripts with similar behaviors), but existing systems
    > will still continue to be adapted to fit local cultural demands
    > for minorities and specialized areas,

    True.

    > that a too strict encoding model proposed now by Unicode cannot fit well.

    But this is undemonstrated and non sequitur besides.

    > Some examples include text that use a non linear layout, where
    > the layout carries important semantics (examples are numerous
    > for hieroglyphic languages, one of which having modern use and
    > not fitting well with Unicode which often fails to represent
    > clusters with simple combining sequences assuming a base
    > character and diacritics).

    You assume that all aspects of Egyptian hieroglyphics can and *should*
    be represented directly in plain text. Try changing your assumption.

    > To allow users to create their own clusters, Unicode has added
    > ideographic description characters which are controls used as
                                         ^^^^^^^^^^^^^^^^^^
    > prefixes for a combining sequence containing base "letters".

    False. Please reread the standard.

    > This is already a break in the axiomatic view of combining
    > sequences made with a single base letter.

    False.

    >
    > Other areas where combining sequences are not following this
    > model is of course the Hangul script,

    False. Conjoing jamos are not formally combining characters in the
    standard (although in principle they *could* have been analyzed
    in those terms). Sequences of conjoining jamos are not formally
    combining character sequences in the standard, which is why separate
    mention of them has to be made in defining canonical equivalence.

    > Really there already exists many exceptions to the axiomatic
    > view of combining sequences,

    Perhaps there are many exceptions to *your* axiomatic view of
    combining sequences. But there are *no* exceptions in the standard
    itself to its definition of combining character sequences.

    > and I don't see why there could not exist a model allowing
    > new classes of combining characters attached to a *following*
    > base character,

    There *are* alternative ways to specify character combination in
    groups, if that is what you are trying to get at here. In that
    sense, yes, the rules for conjoining jamos are distinct from the
    behavior of combining character sequences in the standard.

    But until you learn to be more precise in your application of
    the terminology, advocating an alternative model for how to
    encode a script such as Tengwar is only going to keep getting
    you into trouble with the experts on this list and tend to
    confuse people who are trying to learn about the use of
    the standard.

    > such as for Tangwar Sindarin vowels (if we
    > suppose that Sindarin vowels are encoded separately from Quenya
    > vowels, because of their distinct combining properties, and
    > because the Tengwar "script" is really a family of related
    > scripts, which contains much more differences than between
    > Latin, Greek and Cyrillic separate scripts).

    Undemonstrated.

    >
    > So one cannot be satisfied by the currently limited model
    > with a single base letter and combining modifiers,

    One cannot be satisfied with your currently limited understanding
    of the distinctions in the standard and imprecise use of
    its terminology.

    --Ken

    > which would create an artificial hierarchy between letters,
    > that does not fit the cultural semantics of the encoded language.
    >
    > --
    > Philippe.



    This archive was generated by hypermail 2.1.5 : Fri Aug 08 2003 - 15:13:13 EDT