RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)

Date: Mon Oct 29 2007 - 15:36:41 CST

    Quoting Philippe Verdy <>:

    > wrote:
    >> When I have talked with Chinese publishers about IT difficulties the
    >> most common issue raised by far is how to add characters, the number
    >> of which would be reduced to almost zero if a composite model than
    >> precompossed model was used.
    >> Stabilty rules about cannonical equivalance may well be the biggest
    >> obstacle.
    > It is an obstacle only if considering the current encoding of IDS as a
    > graphical linear orthography, that has NO canonical equivalence with the
    > characters they "represent". In fact they don't represent them but describe
    > them weakly.
    > In order to build a composing model for Han, it would be required to only
    > only include new IDS characters, these ones having a non-descriptive but
    > compositive property; on addition, it would be impossible (stability) to
    > redecompose the existing Han characters that are already singleton in both
    > NFC and NFD. It would even be impossible to decompose them using NFKC/NFKD.

    My apologies for an inexact terminology here - the esssnce of what I
    wished to say is as you say, that a decompositonal model that
    decomposes existing encoded characters would break stability rules.

    > So a completely new composition model would have to be adopted, distinct
    > from the one used with NFC/NFD. Certainly, most of the work already
    > performed with IDS/IDC could be kept to create this model, but for now, the
    > 20% that remain are not satisfyingly described, and that's a lot of work to
    > get something reliable.

    Much of the outstanding 20% can also be dealt with fairly quickly,but
    would need something other than IDS. Overall the component model would
    save time.

    In the current research I am doing the need is to catalogue and
    analysis a large number of texts, including an estimate 10 000
    unencoded characters. The aim is to automate the process as far as
    possible so a number for researchers in different locations can input
    data at the same time, the automated processing works on a compnent
    model. The percentage of new characters that can be processed
    automatically will show thw completeness or otherwise of such a model.
    This project has a few years to run yet. As to whether the model will
    be used outside of academic research I do not know.

    > The current approach, that attempts to compose IDS using additional numeric
    > positions for strokes is not very suitable for creating a normalization,
    > there's some evidence that a more descriptive composition model could avoid
    > using this graphical positional (i.e. without using x,y coordinates like it
    > is now, because it does not work with various ideographic font styles, and
    > these coordinates are not easily predictable).

    I assume here by current approach you mean Wenlin's CDL, which is
    based on cartesian co-ordinates. This is good for font making but bad
    of a component based model. As you say the CDL is limited because it
    givesjust one repesentation of a character. CJKV characters are not
    formed based on a cartesian system, the component based model should
    be based on the way characters are form, these comcepts are more
    topological than cartesian.

    > This work should be completed, and studied with various styles, to see what
    > they have in common, and get a complete inventory of the accepted
    > variations, so that these variations can be modelized and simplified. The
    > IDC characters are just the start of this unfinished model. May be, the
    > solution will be to add more IDC characters to encode the missing
    > distinctions (and then apply the external IDS normalization rules, enhanced
    > by these additional IDC's).

    IDCs where not designed to be used of a component model. though it is
    correct to say the current set of IDCs are imcomplete. Also imcomplete
    are the set of radicals enconded.

    Yours sincerely
    John Knightely


