RE: Multiple script Handling (kanji - kana)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Jan 25 2002 - 06:06:30 EST


Berthold Frommann wrote:
> 生物 = 1. seibutsu, 2. namamono
> 今日 = 1. konnichi, 2. kyou
> 上手 = 1. jouzu, 2. uwate, 3. kamite

Do you have a rough evaluation of how many compound words have multiple
readings?

> Readings of place names and personal names are especially difficult to
> figure out.
> However, it really depends on what kind of data you are about
> to process. If
> you e.g. have two fields for a Japanese person's name, one in
> kanji, and the
> transcription in kana, you could at least check whether it is
> among the
> correct transcriptions of the name ... (sigh!)

Another possibility could be to process furigana (or "ruby"), if present in
the text.

Furigana are small kana characters (hiragana, normally) which are written on
top of difficult kanji's to show their pronunciation. What is meant by
"difficult kanji" ranges from any kanji (in children book) to only the most
obscure proper names.

In plain-text unicode, furigana might be encoded using this set of control
characters:

        U+FFF9 (INTERLINEAR ANNOTATION ANCHOR)
        U+FFFA (INTERLINEAR ANNOTATION SEPARATOR)
        U+FFFB (INTERLINEAR ANNOTATION TERMINATOR)

The format of a word with furigana should be:

        U+FFF9 kanji(s) U+FFFA hiragana(s) U+FFFB

The matching program should simply keep the hiragana(s) and remove all the
rest of the sequence.

If the text is not plain-text, there may be other ways of indicating
furigana. E.g., HTML has its own markup for ruby.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jan 25 2002 - 05:50:14 EST