Re: [unicode] Unihan database: kCangjie field

From: Charlie Ruland (ruland@luckymail.com)
Date: Tue Jun 16 2009 - 17:47:27 CDT

  • Next message: John H. Jenkins: "Re: Jyutping Phrase Box to be removed (was: Unihan database: kCangjie field)"

    Oh, it seems I’ve just found the complete list of my IME’s Cangjie 5 codes.
    Go to http://hyperrate.com/thread.php?tid=6172 and click on cj5.cin.bz2
    <http://cle.linux.org.tw/trac/attachment/wiki/GcinTables/cj5.cin.bz2?format=raw>
    to download.

    Regards,

    Charlie
    <http://cle.linux.org.tw/trac/attachment/wiki/GcinTables/cj5.cin.bz2?format=raw>
    -------- Original Message --------
    Subject: Re: [unicode] Unihan database: kCangjie field
    From: Charlie Ruland <ruland@luckymail.com>
    To: Edward Cherlin <echerlin@gmail.com>
    Date: Wed Jun 17 2009 00:15:33 GMT+0200
    > I don’t know if the following is helpful:
    >
    > After downloading the 2008 version of the Cangjie 5 IME 第五代仓颉输入
    > 法 (2008年最新版) from Malaysia’s Friends of Cangjie at
    > http://www.chinesecj.com/newsoftware/index3.php?Type=1 and installing
    > it on my WinXP machine, the file cj5-win.MB was copied to the
    > C:\WINDOWS\system32 folder.
    >
    > This UTF-16LE-encoded file seems to contain all Cangjie codes that the
    > IME makes use of in the following format:
    >
    > <code><ctrl1><char><ctrl2>
    >
    > where:
    > <code> is the Cangjie 5 code (up to five Latin small letters a-z);
    > <ctrl1> is a control character below U+0020;
    > <char> is a single Han or other character (incl. Latin a-z), or a
    > sequence* of Han characters;
    > <ctrl2> is another control character below U+0020, but missing for the
    > very last entry.
    >
    > The start after the file header is: <U+0061> <U+0001> <U+65E5>
    > <U+0001> <U+0061> <U+0001> <U+66F0> <U+0002> ...
    >
    > *The IME supports input of words 詞語輸入 using four-letter codes.
    > These Chinese words (i.e., character sequences), as well as letters,
    > punctuation, symbols and the like, are of no significance to our
    > purpose of mapping Cangjie codes to single Han characters.
    >
    > Please note that a single Han character may be mapped to several
    > Cangjie codes due to glyph variation. Please also note that only
    > Chinese glyph variants are taken into account, e.g. ‘禅’ is only
    > mapped to ‘ifcwj’, not to ‘iffwj’ according to its standard Japanese
    > form. It would of course be nice to have codes for non-Chinese glyph
    > variants too.
    >
    > Thanks, Edward, for your help,
    >
    > Charlie
    >
    > -------- Original Message --------
    > Subject: Re: [unicode] Unihan database: kCangjie field
    > From: Edward Cherlin <echerlin@gmail.com>
    > To: John H. Jenkins <jenkins@apple.com>
    > Date: Tue Jun 16 2009 09:07:42 GMT+0200
    >> Here is a link for Cangjie 5 tables, 第五代倉頡字碼表. It is arranged in
    >> "alphabetical" order of Cangjie codes, in 25 pages. (There is no
    >> Cangjie code mapped to 'z'.)
    >>
    >> http://cbflabs.com/book/ocj5/ocj5/16.htm
    >>
    >> 附錄六
    >>
    >> 第五代倉頡字碼表
    >>
    >> ───────────────────────────
    >>
    >> 以下為第五代倉頡字碼表,根據字母順序,從日部到卜部依序排列。表中
    >> 第一欄為中文字形,第二欄字級稍小者,為該字形的中文字碼,第三欄為相對
    >> 應的英文字母。
    >>
    >> Once we get clear on the license, I can download all of this and put
    >> it into a comma-delimited file. Someone else will have to fill in
    >> characters and provide the Unicode mapping, since a lot of characters
    >> are missing from these tables.
    >>
    >> On Mon, Jun 15, 2009 at 11:32 PM, Edward Cherlin<echerlin@gmail.com>
    >> wrote:
    >>
    >>> On Sun, Jun 14, 2009 at 6:45 PM, John H. Jenkins<jenkins@apple.com>
    >>> wrote:
    >>>
    >>>> If someone is willing to do the work to contact these people, get
    >>>> their
    >>>> permission, and write up a document for the UTC describing the data
    >>>> and
    >>>> provide Richard Cook or me with the actual data, then I don't think
    >>>> that
    >>>> there would be any real problem to adding it.
    >>>>
    >>> I'll write to them, and to Edouard Butler, author of Cangjie Method
    >>> (in English), who works with Chu Bong-Foo, inventor of Cangjie.
    >>>
    >>>
    >>>> Basically, here as elsewhere, the actual work involved is likely to
    >>>> be more
    >>>> time-consuming than one thinks and neither Dr. Cook nor I have as
    >>>> much time
    >>>> as we would like to devote to it. The best way to see that
    >>>> something makes
    >>>> it into the Unihan database is to do the work of data collection
    >>>> for us.
    >>>>
    >>>> 在 Jun 15, 2009 1:57 AM 時, Charlie Ruland 寫到:
    >>>>
    >>>>
    >>>>> If it is true that the Unihan database has Cangjie v.3 input codes
    >>>>> for
    >>>>> only 29,148 characters, whereas Malaysia’s Friends of Cangjie have
    >>>>> Cangjie
    >>>>> v.5 codes for all CJK[V] unified ideographs of Unicode 4.0, why
    >>>>> not add a
    >>>>> “kCangjie5” field based on the more exhaustive data from Malaysia
    >>>>> to the
    >>>>> Unihan database (or, entirely replace the Cangjie v.3 data of the
    >>>>> “kCangjie”
    >>>>> field with the Cangjie v.5 data)?
    >>>>>
    >>>>> BTW, Malaysia’s Friends of Cangjie seem to be willing to have
    >>>>> their data
    >>>>> published: e.g., the English Wiktionary has the page
    >>>>> http://en.wiktionary.org/wiki/Wiktionary:Chinese_Cangjie_index
    >>>>> where it
    >>>>> says: “Cāngjié data was taken from www.chinesecj.com with
    >>>>> permission.”
    >>>>>
    >>>>> Charlie
    >>>>>
    >>>>> -------- Original Message --------
    >>>>> Subject: Re: [unicode] Unihan database: kCangjie field
    >>>>> From: mpsuzuki@hiroshima-u.ac.jp
    >>>>> To: Charlie Ruland <ruland@luckymail.com>
    >>>>> Date: Sun Jun 14 2009 07:30:59 GMT+0200
    >>>>>
    >>>>>> Hi,
    >>>>>>
    >>>>>> Checking the kCangjie entry for U+9762 (面) in Unihan.txt,
    >>>>>> we can find this line:
    >>>>>>
    >>>>>> U+9762 kCangjie MWYL
    >>>>>>
    >>>>>> I guess, this is Cangjie version 3 style.
    >>>>>> If it's version 5 style, it should be MWSL.
    >>>>>>
    >>>>>>
    >>>>>> http://zh.wikipedia.org/wiki/%E5%80%89%E9%A0%A1%E8%BC%B8%E5%85%A5%E6%B3%95
    >>>>>>
    >>>>>>
    >>>>>> According to UTR#38, kCangjie field is based on Christian
    >>>>>> Wittern's cangjie-table.b5.
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>> Tag: kCangjie
    >>>>>>> Status: Provisional
    >>>>>>> Category: Dictionary-like Data
    >>>>>>> Separator: space
    >>>>>>> Syntax: [A-Z]+
    >>>>>>> Description: The cangjie input code for the character.
    >>>>>>> This incorporates data from the file cangjie-table.b5
    >>>>>>> by Christian Wittern.
    >>>>>>>
    >>>>>>>
    >>>>>> According to Christian Wittern's web site at Kyoto Univ.,
    >>>>>> it seems that he has not updated cangjie-table.b5 since
    >>>>>> 1993-Nov.
    >>>>>>
    >>>>>> http://kanji.zinbun.kyoto-u.ac.jp/~wittern/publications/data/index.html
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>> Cangjie Table: Table of all cangjie input keys,
    >>>>>>> with radical / stroke and BIG5 code ,
    >>>>>>> in: ftp://ifcss.org/software/data, November 1993.
    >>>>>>>
    >>>>>>>
    >>>>>> I think the popular version of cangjie-table.b5 used in
    >>>>>> various free softwares is 1.02 released on 1993-May.
    >>>>>> e.g.
    >>>>>>
    >>>>>> http://linenum.info/p/emacs/22.1/leim/MISC-DIC/cangjie-table.b5?page=1
    >>>>>>
    >>>>>>
    >>>>>> http://linenum.info/p/emacs/22.1/leim/MISC-DIC/cangjie-table.b5?page=27
    >>>>>>
    >>>>>> It includes 13059 entries to cover Big5 with ETen extension.
    >>>>>>
    >>>>>> On the other hand, Unihan.txt 5.1.0 (2008-Mar-03) includes
    >>>>>> 29148 entries. I don't know who added extra kCangjie to
    >>>>>> cover the characters which are not included in original
    >>>>>> cangjie-table.b5 by Christian.
    >>>>>>
    >>>>>> Regards,
    >>>>>> mpsuzuki
    >>>>>>
    >>>>>> On Sat, 13 Jun 2009 19:14:49 +0200
    >>>>>> Charlie Ruland <ruland@luckymail.com> wrote:
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>> The Cangjie input code of which Cangjie version is given in the
    >>>>>>> Unihan
    >>>>>>> database?
    >>>>>>> I couldn't find any explicit information on this in the Unicode
    >>>>>>> Standard
    >>>>>>> Annex #38: Unicode Han Database (Unihan) at
    >>>>>>> http://www.unicode.org/reports/tr38/ .
    >>>>>>> FYI, I use a Cangjie version 5 IME (第五代倉頡輸入法) designed
    >>>>>>> by and downloaded
    >>>>>>> from Malaysia’s Friends of Cangjie (倉頡之友。馬來西亞 at
    >>>>>>> http://www.chinesecj.com/newsoftware/index3.php?Type=1 ) and
    >>>>>>> which promises
    >>>>>>> to support input of some 70,000 characters.
    >>>>>>> Are all Unihan kCangjie codes usable on my IME?
    >>>>>>>
    >>>>>>> Charlie
    >>>>>>>
    >>>>>>> --
    >>>>>>> ___ Charlie Ruland ___ 冉書慧 ___
    >>>>>>> ERROR__COMMVNIS__FACIT__IVS
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>
    >>>>>>
    >>>>> --
    >>>>> — Charlie Ruland — 冉書慧 —
    >>>>> ERROR·COMMVNIS·FACIT·IVS
    >>>>>
    >>>>>
    >>>>>
    >>>> =====
    >>>> John H. Jenkins
    >>>> jenkins@apple.com
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>
    >>> --
    >>> Silent Thunder (默雷/धर्ममेघशब्दगर्ज/دھرممیگھشبدگر ج) is my name
    >>> And Children are my nation.
    >>> The Cosmos is my dwelling place, The Truth my destination.
    >>> http://earthtreasury.org/worknet (Edward Mokurai Cherlin)
    >>>
    >>>
    >>
    >>
    >>
    >>
    >

    -- 
    — Charlie Ruland — 冉書慧 —
    ERROR·COMMVNIS·FACIT·IVS
    


    This archive was generated by hypermail 2.1.5 : Tue Jun 16 2009 - 17:50:58 CDT