Re: Level of Unicode support required for CJKV

From: vunzndi@vfemail.net
Date: Sat Oct 27 2007 - 03:26:57 CDT

  • Next message: vunzndi@vfemail.net: "Re: Level of Unicode support required for CJKV"

    Quoting James Kass <thunder-bird@earthlink.net>:

    >
    > John Knightley wrote,
    >
    >>> The difference and similarity between radicals 72 and 73 are
    >>> reflected as Unification Pattern No. 68 on this beta page:
    >>> http://kanji-database.sourceforge.net/housetsu.html
    >>
    >> The page is a beta page and not mature, flag/pattern No 68 is one that
    >> is IMHO wrong pattern 68 will probably be drepreciated or removed in
    >> the future
    >
    > In addition to noting that this is a beta page, we also note that
    > flag/pattern isn't a rule. It's only a flag/marker/pattern.
    >
    > (It is my understanding that) these flags are generated by
    > machine with the intent that anything flagged be checked
    > by a human being.
    >
    > Because radicals 72 and 73 have the same essential shape and
    > are confusable, and because IDS accompanying proposed new
    > characters may come from various sources, I think it is a
    > good flag/pattern. Even though most everything flagged
    > under pattern number 68 would not be unifiable, it might
    > catch a duplicate submission which would otherwise be missed
    > until it is too late.
    >

    Point taken. At some point in time hopefully in the not too distant
    future, as a result of the process of reviewing Annex S, it should be
    possible for many many of the flags on the beta page mention to say
    which are unifiable compnents, and which are components where a human
    check is required. When this happens flag 68 will definately be in the
    latter, and not the former.

    > But, of course, you are right in saying that radical 72 and
    > radical 73 aren't unifiable.
    >
    > Because of my approach, I'm inclined to think that where two
    > separate Unicode characters could be printed using the same
    > piece of metal type, those characters would be interchangeable.
    > If someone hands you a small piece of paper with a single CJK
    > character hand-written on it and asks you for the Unicode
    > number for that character, it should be possible to give an
    > unambiguous answer. When someone is using a radical/stroke
    > look-up utility to find a certain character, they would tend
    > to stop as soon as they found a character identical in appearance
    > with the one sought.
    >

    I have to admit similar questions about unification have been raised
    by those involved in producing bitmap/rasta fonts of CJKV.

    > There's also the issue of optical character recognition software
    > which must deal with these confusables. If the O.C.R. software
    > finds a visual exact match and presents it for review to the
    > person initializing the software, it's going to look on-screen
    > exactly like it looked on the scanned original. So how would
    > this person know whether the character selected by the
    > software was correct? A sophisticated O.C.R. system might
    > anticipate this and present all confusables in a fashion which
    > would enable the user to select the appropriate character,
    > I suppose.
    >

    CJKV OCR software, like many other OCR software, tends to include a
    dictionary type database to help decide from context what a character
    might be. If one takes a non-putonghua text of CJKV and put it through
    a putonghua OCR, the number of misread characters is very large.

    Regards
    John

    > Best regards,
    >
    > James Kass
    >
    >
    >

    -------------------------------------------------
    This message sent through Virus Free Email
    http://www.vfemail.net



    This archive was generated by hypermail 2.1.5 : Sat Oct 27 2007 - 03:29:48 CDT