RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)

Date: Mon Oct 29 2007 - 15:36:41 CST

  • Next message: "RE: Level of Unicode support required for various languages"

    Quoting Philippe Verdy <>:

    > wrote:
    >> When I have talked with Chinese publishers about IT difficulties the
    >> most common issue raised by far is how to add characters, the number
    >> of which would be reduced to almost zero if a composite model than
    >> precompossed model was used.
    >> Stabilty rules about cannonical equivalance may well be the biggest
    >> obstacle.
    > It is an obstacle only if considering the current encoding of IDS as a
    > graphical linear orthography, that has NO canonical equivalence with the
    > characters they "represent". In fact they don't represent them but describe
    > them weakly.
    > In order to build a composing model for Han, it would be required to only
    > only include new IDS characters, these ones having a non-descriptive but
    > compositive property; on addition, it would be impossible (stability) to
    > redecompose the existing Han characters that are already singleton in both
    > NFC and NFD. It would even be impossible to decompose them using NFKC/NFKD.

    My apologies for an inexact terminology here - the esssnce of what I
    wished to say is as you say, that a decompositonal model that
    decomposes existing encoded characters would break stability rules.

    > So a completely new composition model would have to be adopted, distinct
    > from the one used with NFC/NFD. Certainly, most of the work already
    > performed with IDS/IDC could be kept to create this model, but for now, the
    > 20% that remain are not satisfyingly described, and that's a lot of work to
    > get something reliable.

    Much of the outstanding 20% can also be dealt with fairly quickly,but
    would need something other than IDS. Overall the component model would
    save time.

    In the current research I am doing the need is to catalogue and
    analysis a large number of texts, including an estimate 10 000
    unencoded characters. The aim is to automate the process as far as
    possible so a number for researchers in different locations can input
    data at the same time, the automated processing works on a compnent
    model. The percentage of new characters that can be processed
    automatically will show thw completeness or otherwise of such a model.
    This project has a few years to run yet. As to whether the model will
    be used outside of academic research I do not know.

    > The current approach, that attempts to compose IDS using additional numeric
    > positions for strokes is not very suitable for creating a normalization,
    > there's some evidence that a more descriptive composition model could avoid
    > using this graphical positional (i.e. without using x,y coordinates like it
    > is now, because it does not work with various ideographic font styles, and
    > these coordinates are not easily predictable).

    I assume here by current approach you mean Wenlin's CDL, which is
    based on cartesian co-ordinates. This is good for font making but bad
    of a component based model. As you say the CDL is limited because it
    givesjust one repesentation of a character. CJKV characters are not
    formed based on a cartesian system, the component based model should
    be based on the way characters are form, these comcepts are more
    topological than cartesian.

    > This work should be completed, and studied with various styles, to see what
    > they have in common, and get a complete inventory of the accepted
    > variations, so that these variations can be modelized and simplified. The
    > IDC characters are just the start of this unfinished model. May be, the
    > solution will be to add more IDC characters to encode the missing
    > distinctions (and then apply the external IDS normalization rules, enhanced
    > by these additional IDC's).

    IDCs where not designed to be used of a component model. though it is
    correct to say the current set of IDCs are imcomplete. Also imcomplete
    are the set of radicals enconded.

    Yours sincerely
    John Knightely


    This message sent through Virus Free Email

    This archive was generated by hypermail 2.1.5 : Mon Oct 29 2007 - 15:40:29 CST