Re: [unicode] Unihan database: kCangjie field

From: Charlie Ruland (ruland@luckymail.com)
Date: Tue Jun 16 2009 - 17:15:33 CDT

  • Next message: Bjoern Hoehrmann: "Re: Character set conversion question"

    I don’t know if the following is helpful:

    After downloading the 2008 version of the Cangjie 5 IME 第五代仓颉输入法
    (2008年最新版) from Malaysia’s Friends of Cangjie at
    http://www.chinesecj.com/newsoftware/index3.php?Type=1 and installing it
    on my WinXP machine, the file cj5-win.MB was copied to the
    C:\WINDOWS\system32 folder.

    This UTF-16LE-encoded file seems to contain all Cangjie codes that the
    IME makes use of in the following format:

    <code><ctrl1><char><ctrl2>

    where:
    <code> is the Cangjie 5 code (up to five Latin small letters a-z);
    <ctrl1> is a control character below U+0020;
    <char> is a single Han or other character (incl. Latin a-z), or a
    sequence* of Han characters;
    <ctrl2> is another control character below U+0020, but missing for the
    very last entry.

    The start after the file header is: <U+0061> <U+0001> <U+65E5> <U+0001>
    <U+0061> <U+0001> <U+66F0> <U+0002> ...

    *The IME supports input of words 詞語輸入 using four-letter codes. These
    Chinese words (i.e., character sequences), as well as letters,
    punctuation, symbols and the like, are of no significance to our purpose
    of mapping Cangjie codes to single Han characters.

    Please note that a single Han character may be mapped to several Cangjie
    codes due to glyph variation. Please also note that only Chinese glyph
    variants are taken into account, e.g. ‘禅’ is only mapped to ‘ifcwj’,
    not to ‘iffwj’ according to its standard Japanese form. It would of
    course be nice to have codes for non-Chinese glyph variants too.

    Thanks, Edward, for your help,

    Charlie

    -------- Original Message --------
    Subject: Re: [unicode] Unihan database: kCangjie field
    From: Edward Cherlin <echerlin@gmail.com>
    To: John H. Jenkins <jenkins@apple.com>
    Date: Tue Jun 16 2009 09:07:42 GMT+0200
    > Here is a link for Cangjie 5 tables, 第五代倉頡字碼表. It is arranged in
    > "alphabetical" order of Cangjie codes, in 25 pages. (There is no
    > Cangjie code mapped to 'z'.)
    >
    > http://cbflabs.com/book/ocj5/ocj5/16.htm
    >
    > 附錄六
    >
    > 第五代倉頡字碼表
    >
    > ───────────────────────────
    >
    > 以下為第五代倉頡字碼表,根據字母順序,從日部到卜部依序排列。表中第一欄為中文字形,第二欄字級稍小者,為該字形的中文字碼,第三欄為相對應的英文字母。
    >
    > Once we get clear on the license, I can download all of this and put
    > it into a comma-delimited file. Someone else will have to fill in
    > characters and provide the Unicode mapping, since a lot of characters
    > are missing from these tables.
    >
    > On Mon, Jun 15, 2009 at 11:32 PM, Edward Cherlin<echerlin@gmail.com> wrote:
    >
    >> On Sun, Jun 14, 2009 at 6:45 PM, John H. Jenkins<jenkins@apple.com> wrote:
    >>
    >>> If someone is willing to do the work to contact these people, get their
    >>> permission, and write up a document for the UTC describing the data and
    >>> provide Richard Cook or me with the actual data, then I don't think that
    >>> there would be any real problem to adding it.
    >>>
    >> I'll write to them, and to Edouard Butler, author of Cangjie Method
    >> (in English), who works with Chu Bong-Foo, inventor of Cangjie.
    >>
    >>
    >>> Basically, here as elsewhere, the actual work involved is likely to be more
    >>> time-consuming than one thinks and neither Dr. Cook nor I have as much time
    >>> as we would like to devote to it. The best way to see that something makes
    >>> it into the Unihan database is to do the work of data collection for us.
    >>>
    >>> 在 Jun 15, 2009 1:57 AM 時, Charlie Ruland 寫到:
    >>>
    >>>
    >>>> If it is true that the Unihan database has Cangjie v.3 input codes for
    >>>> only 29,148 characters, whereas Malaysia’s Friends of Cangjie have Cangjie
    >>>> v.5 codes for all CJK[V] unified ideographs of Unicode 4.0, why not add a
    >>>> “kCangjie5” field based on the more exhaustive data from Malaysia to the
    >>>> Unihan database (or, entirely replace the Cangjie v.3 data of the “kCangjie”
    >>>> field with the Cangjie v.5 data)?
    >>>>
    >>>> BTW, Malaysia’s Friends of Cangjie seem to be willing to have their data
    >>>> published: e.g., the English Wiktionary has the page
    >>>> http://en.wiktionary.org/wiki/Wiktionary:Chinese_Cangjie_index where it
    >>>> says: “Cāngjié data was taken from www.chinesecj.com with permission.”
    >>>>
    >>>> Charlie
    >>>>
    >>>> -------- Original Message --------
    >>>> Subject: Re: [unicode] Unihan database: kCangjie field
    >>>> From: mpsuzuki@hiroshima-u.ac.jp
    >>>> To: Charlie Ruland <ruland@luckymail.com>
    >>>> Date: Sun Jun 14 2009 07:30:59 GMT+0200
    >>>>
    >>>>> Hi,
    >>>>>
    >>>>> Checking the kCangjie entry for U+9762 (面) in Unihan.txt,
    >>>>> we can find this line:
    >>>>>
    >>>>> U+9762 kCangjie MWYL
    >>>>>
    >>>>> I guess, this is Cangjie version 3 style.
    >>>>> If it's version 5 style, it should be MWSL.
    >>>>>
    >>>>>
    >>>>> http://zh.wikipedia.org/wiki/%E5%80%89%E9%A0%A1%E8%BC%B8%E5%85%A5%E6%B3%95
    >>>>>
    >>>>> According to UTR#38, kCangjie field is based on Christian
    >>>>> Wittern's cangjie-table.b5.
    >>>>>
    >>>>>
    >>>>>
    >>>>>> Tag: kCangjie
    >>>>>> Status: Provisional
    >>>>>> Category: Dictionary-like Data
    >>>>>> Separator: space
    >>>>>> Syntax: [A-Z]+
    >>>>>> Description: The cangjie input code for the character.
    >>>>>> This incorporates data from the file cangjie-table.b5
    >>>>>> by Christian Wittern.
    >>>>>>
    >>>>>>
    >>>>> According to Christian Wittern's web site at Kyoto Univ.,
    >>>>> it seems that he has not updated cangjie-table.b5 since
    >>>>> 1993-Nov.
    >>>>>
    >>>>> http://kanji.zinbun.kyoto-u.ac.jp/~wittern/publications/data/index.html
    >>>>>
    >>>>>
    >>>>>> Cangjie Table: Table of all cangjie input keys,
    >>>>>> with radical / stroke and BIG5 code ,
    >>>>>> in: ftp://ifcss.org/software/data, November 1993.
    >>>>>>
    >>>>>>
    >>>>> I think the popular version of cangjie-table.b5 used in
    >>>>> various free softwares is 1.02 released on 1993-May.
    >>>>> e.g.
    >>>>> http://linenum.info/p/emacs/22.1/leim/MISC-DIC/cangjie-table.b5?page=1
    >>>>> http://linenum.info/p/emacs/22.1/leim/MISC-DIC/cangjie-table.b5?page=27
    >>>>> It includes 13059 entries to cover Big5 with ETen extension.
    >>>>>
    >>>>> On the other hand, Unihan.txt 5.1.0 (2008-Mar-03) includes
    >>>>> 29148 entries. I don't know who added extra kCangjie to
    >>>>> cover the characters which are not included in original
    >>>>> cangjie-table.b5 by Christian.
    >>>>>
    >>>>> Regards,
    >>>>> mpsuzuki
    >>>>>
    >>>>> On Sat, 13 Jun 2009 19:14:49 +0200
    >>>>> Charlie Ruland <ruland@luckymail.com> wrote:
    >>>>>
    >>>>>
    >>>>>
    >>>>>> The Cangjie input code of which Cangjie version is given in the Unihan
    >>>>>> database?
    >>>>>> I couldn't find any explicit information on this in the Unicode Standard
    >>>>>> Annex #38: Unicode Han Database (Unihan) at
    >>>>>> http://www.unicode.org/reports/tr38/ .
    >>>>>> FYI, I use a Cangjie version 5 IME (第五代倉頡輸入法) designed by and downloaded
    >>>>>> from Malaysia’s Friends of Cangjie (倉頡之友。馬來西亞 at
    >>>>>> http://www.chinesecj.com/newsoftware/index3.php?Type=1 ) and which promises
    >>>>>> to support input of some 70,000 characters.
    >>>>>> Are all Unihan kCangjie codes usable on my IME?
    >>>>>>
    >>>>>> Charlie
    >>>>>>
    >>>>>> --
    >>>>>> ___ Charlie Ruland ___ 冉書慧 ___
    >>>>>> ERROR__COMMVNIS__FACIT__IVS
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>
    >>>>>
    >>>> --
    >>>> — Charlie Ruland — 冉書慧 —
    >>>> ERROR·COMMVNIS·FACIT·IVS
    >>>>
    >>>>
    >>>>
    >>> =====
    >>> John H. Jenkins
    >>> jenkins@apple.com
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>
    >> --
    >> Silent Thunder (默雷/धर्ममेघशब्दगर्ज/دھرممیگھشبدگر ج) is my name
    >> And Children are my nation.
    >> The Cosmos is my dwelling place, The Truth my destination.
    >> http://earthtreasury.org/worknet (Edward Mokurai Cherlin)
    >>
    >>
    >
    >
    >
    >

    -- 
    — Charlie Ruland — 冉書慧 —
    ERROR·COMMVNIS·FACIT·IVS
    


    This archive was generated by hypermail 2.1.5 : Tue Jun 16 2009 - 17:18:54 CDT