Re: Conflicting principles

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Aug 08 2003 - 10:06:16 EDT

Next message: John Clews: "Which ancestral links"

Previous message: Janusz S. Bieñ: "Re: UTF-8 and HTML import into MS Word 2000"
In reply to: Michael Everson: "RE: Conflicting principles"
Next in thread: Michael Everson: "RE: Conflicting principles"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thursday, August 07, 2003 11:29 PM, Michael Everson <everson@evertype.com> wrote:

> Ken's point of course is that however bizarre the backing store for
> Sindarin and English Tengwar modes may be, combining characters per
> se must follow their base characters no matter what.

Even if that breaks the logical analysis of text?
How does the Sindarin mode affect the line or word breaking rule for example:
suppose that the combining character is coded after the next logical base character, would it be valid to break at this base character and thus send the combining vowel to the next line, where in fact what is intended is to use a vowel carier for the combining character logically attached to the previous base character?

I don't know Tengwar's Sindarin mode enough to see how word breaking can affect the interpretation of text. But preserving the logical ordering of letters seems much more important for actual text encoding than just being constrained by combining rules that were created taking into account only the first encoded scripts for Latin, Greek, Cyrillic, Hebrew, Arabic and Hiragana/Katakana scripts that use combining characters.

The response to such answer would come in relation with other still unencoded scripts; you quoted some of them which have similar difficulties, and that are neither extinct, and have a huge amount of existing texts to represent, including many modern languages that are only partly litterated and that would benefit from a written litteracy form according to similar languages spoken and written in a cultural region, notably in Africa, Central Asia, and Oceania (regions that have suffered for too long of an absence of an easy to adapt and learn writing system for minority languages).

Even in India, there is still no consensus for the use of the ISCII-based writing system for Brahmic scripts, and the current work on Tibetan or on Indo-Aryan languages show that the currently officially adopted system does not fit the cultural demand of minority users, because the official writing system does not fit very well their language.

There will certainly not be a huge revolution in writing systems (families of scripts with similar behaviors), but existing systems will still continue to be adapted to fit local cultural demands for minorities and specialized areas, that a too strict encoding model proposed now by Unicode cannot fit well. Some examples include text that use a non linear layout, where the layout carries important semantics (examples are numerous for hieroglyphic languages, one of which having modern use and not fitting well with Unicode which often fails to represent clusters with simple combining sequences assuming a base character and diacritics).

If one looks at Korean jamos, the problem has only been solved by actually *reducing* the number of layout combinations, and creating artificial "letters" (jamos) for some combinations that are logically perceived as multiple letters (for example the SSANGKIEOK jamo, which is really a pair of KIEOK letters), which are only partly decomposed and represented as their component letters, whose composition layout is greatly simplified but does not match correctly the historic Hangul clusters.

Probably the same thing can be said about Han ideographs, constantly updated to present new clusters, and even Hiragana/Katakana clusters currently represented as single codepoints when in fact they are really composed, and constantly enriched with new clusters notably in the scientific area. To allow users to create their own clusters, Unicode has added ideographic description characters which are controls used as prefixes for a combining sequence containing base "letters". This is already a break in the axiomatic view of combining sequences made with a single base letter.

Other areas where combining sequences are not following this model is of course the Hangul script, the CGJ character used between two base letters, double (width) diacritics, ... Really there already exists many exceptions to the axiomatic view of combining sequences, and I don't see why there could not exist a model allowing new classes of combining characters attached to a *following* base character, such as for Tangwar Sindarin vowels (if we suppose that Sindarin vowels are encoded separately from Quenya vowels, because of their distinct combining properties, and because the Tengwar "script" is really a family of related scripts, which contains much more differences than between Latin, Greek and Cyrillic separate scripts).

So one cannot be satisfied by the currently limited model with a single base letter and combining modifiers, which would create an artificial hierarchy between letters, that does not fit the cultural semantics of the encoded language.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Next message: John Clews: "Which ancestral links"
Previous message: Janusz S. Bieñ: "Re: UTF-8 and HTML import into MS Word 2000"
In reply to: Michael Everson: "RE: Conflicting principles"
Next in thread: Michael Everson: "RE: Conflicting principles"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Aug 08 2003 - 10:52:26 EDT