Re: writing Chinese dialects

From: vunzndi@vfemail.net
Date: Sun Feb 04 2007 - 18:09:19 CST

  • Next message: Otto Stolz: "Re: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts"

    Dear Arne,

    I would certianly welcome help putting the data into standard ids
    format. The file is exported from a database of mine that uses a
    format similar to ids ( close enough for a fuzzy search as described
    below) . I do have a more recent version which I think is too big for
    the mailing and so I will send it to you seperately . Briefly the
    ideas are
         1. ? and ?? missing or uncertain character/data (similar to
    the ids_irg.txt where ? usually denotes a missing character)
        2. + , - and brackets with obvious usage
       3. A+B combinations as opposed to Mr Taichi Kawabata's reverse
    polish +AB ordering
      4. A-B premited where the part/radical is not in unicode

    It would be fair to say that only the 4th option allowing A-B, is
    particularly useful, in other respects Mr Taichi Kawabata's system is
    much better for doing sophiticated searches where ids are flattend,
    that is broken down into parts before searching.

    A straight subsitution, leaves the orders incorrect, I therefore left
    the data in with it's +,- and brackets so that it would be obvious
    that there was a difference. I was planning to reorder after do on
    last check of the data.

    John

    Quoting "Arne<arne@linux.org.tw>:

    > On Sunday 04 February 2007 23:53, vunzndi@vfemail.net wrote:
    >> For Extension B the best is Mr Taichi Kawabata's ids_irg.txt which
    >> includes all the cjkv characters presently in unicode at
    >>
    >> <http://www.cse.cuhk.edu.hk/~irg/irg/irg25/IRGN1183A_ids_irg.txt.gz>
    >>
    >> I usually just grep it, sometimes
    >>
    >> $ grep AB ids_irg.txt
    >>
    >> but more often the "fuzzy"
    >>
    >> $ grep A ids_irg.txt | grep B
    >>
    >>
    >> For, the very much smaller, and still to be fully passed Extension C,
    >> there is my "very much a work in progress"
    >> ExtensionC_decomposed.txt, which gives only the IRG numbers since the
    >> characters are not yet official. I hope to update this very soon. For
    >> this please goto
    >> http://east-chr-data.cvs.sourceforge.net/east-chr-data/ExtensionC/dat
    >> a/tables/ExtensionC_decomposed.txt?view=log and download the latest
    >> version.
    >>
    >> Accordiing to this at least 7 characters from your missing list are
    >> apparently in Extension C ( File attached).
    >>
    >> John Knightley
    >
    > Thanks very much, both of you. I think this will help a lot for finding
    > more "missing" characters... :)
    >
    > John, may I help you to update your Ext. C file to use the "correct" IDS
    > instead of "/" and "+" ? ;) I would send you a diff then...
    >
    > Cheers
    > Arne
    > --
    > Arne G

    -------------------------------------------------
    This message sent through Virus Free Email
    http://www.vfemail.net



    This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 18:11:02 CST