Brahmic list ? (was: Oriya: mba / mwa ?)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 29 2003 - 20:12:17 EST

  • Next message: Doug Ewell: "Re: Brahmic list ? (was: Oriya: mba / mwa ?)"

    Michael Everson writes:
    > Peter Constable wrote:
    >
    > > > I think the TDIL chart is wrong.
    > >
    > >It seems reasonable that one should need extra persuasion to take
    > >the word of an American living in Ireland over Indians. (Sorry.)

    Isn't there a specific list for Brahmic scripts? (brahmic@unicode.org ???).

    We are near to explode the number of issues with these scripts if Indian
    sources start publishing new undated references for their encoding and
    conversion to Unicode, including proposed changes of orthographic rules to
    better match either the phonology or the tradition or the inclusion of
    foreign terms.

    SIL.org also is working quite actively in this area, in relation with a
    proposed extended UTR22 reference for transcoding. But I'd like to see
    discussions about proposed UTR22 changes in the main Unicode list.

    There's not much isues with Thai as it has been standardized since long in
    TIS620, which was the base of Unicode encoding (but shamely before UTR22 was
    produced which would have allowed a better logical encoding without needing
    lexical dictionnaries to parse the Thai text). Semantic analysis of Thai
    text is an interesting issue by itself, but not for the correct way to
    encode Thai words (TIS620 rules are clear as it mostly encodes glyphs,
    expecting that readers will interpret the written text using their knowledge
    of the language). So Thai discussions can remain in the main list.

    I also think that Tibetan issues should be discussed in that list, despite
    its composition model is very different from Brahmic scripts of India,
    unless there's a specific rapporteur group for it.

    But not Han issues which should be discussed possibly in their own list in
    relation with the IRG workgroup (which already works on its own technical
    reports as well as the standardization of the extended repertoire).

    The recent issues I have read seem to multiply the number of Brahmic
    conjuncts we have to deal with, possibly in relation with new normalization
    forms (not NFC and NFD); as for Hebrew, there's probably a need for work in
    these scripts with a separate discussion list, with the aim to produce a
    technical report in accordance to Indian sources. Other related South Asian
    scripts should be there too: Lao, Khmer...

    My recent works with UCA and collation, as well as UTR22 and phonologic
    analysis of many texts tend to promote the idea of new normalization forms
    in all areas where NFC/NFD or even NFKC/NFKD are failing (we can't change
    them due to the stability pact, but UCA and collation in general seems to
    create a new coded character set (made of ordered collation weights
    belonging to separate ranges for each collation level, these ranges being
    sorted in the reverse order of the collation level).

    I've tried to experiment a collation algorithm to implement UCA by the same
    system as used in UCD decompositions, but with added (and sometimes
    modified) decompositions. This system creates new "code points" needed to
    represent only <font> compatibility differences, ligatures, or alternate
    forms, as a decomposition of the existing compatibility character, into more
    basic characters exposed with primary differences in UCA, plus these new
    characters given "variable" collation weights, which may be ignorable in
    applications which ignore extra levels. This encoding uses a 31 bit code
    space, which is still highly compressible, but still representable with the
    UTF-8 TES (but they are not containing Unicode code points) or similar
    ad-hoc representation.

    I am currently trying to adapt this system to work in relation with UTR22
    transcodings, and I am testing it against Brahmic scripts, Hebrew, and
    Latin. This is very promizing, and my next step will be to handle
    decomposition of Han characters into their component radicals and strokes. I
    do think that it is possible to handle almost all UCA and UTR22 rules by
    using UTR22 itself and decomposition rules in a simple table matching nearly
    the format of the UCD.

    But all these discussions and encoding ambiguities of Brahmic scripts are
    polluting my work. I am quite near to remove my current work on them, until
    there's some agreement found, notably within an revision of ISCII if there's
    one in preparation which will be more precise and will give more precise
    rules. For now it is impossible for me to adapt my model with the proposed
    (and sometimes contradictory) encoding solutions proposed by distinct
    people.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Sat Nov 29 2003 - 21:02:09 EST