Date: Mon Oct 29 2007 - 15:36:41 CST
Quoting Philippe Verdy <email@example.com>:
> firstname.lastname@example.org wrote:
>> When I have talked with Chinese publishers about IT difficulties the
>> most common issue raised by far is how to add characters, the number
>> of which would be reduced to almost zero if a composite model than
>> precompossed model was used.
>> Stabilty rules about cannonical equivalance may well be the biggest
> It is an obstacle only if considering the current encoding of IDS as a
> graphical linear orthography, that has NO canonical equivalence with the
> characters they "represent". In fact they don't represent them but describe
> them weakly.
> In order to build a composing model for Han, it would be required to only
> only include new IDS characters, these ones having a non-descriptive but
> compositive property; on addition, it would be impossible (stability) to
> redecompose the existing Han characters that are already singleton in both
> NFC and NFD. It would even be impossible to decompose them using NFKC/NFKD.
My apologies for an inexact terminology here - the esssnce of what I
wished to say is as you say, that a decompositonal model that
decomposes existing encoded characters would break stability rules.
> So a completely new composition model would have to be adopted, distinct
> from the one used with NFC/NFD. Certainly, most of the work already
> performed with IDS/IDC could be kept to create this model, but for now, the
> 20% that remain are not satisfyingly described, and that's a lot of work to
> get something reliable.
Much of the outstanding 20% can also be dealt with fairly quickly,but
would need something other than IDS. Overall the component model would
In the current research I am doing the need is to catalogue and
analysis a large number of texts, including an estimate 10 000
unencoded characters. The aim is to automate the process as far as
possible so a number for researchers in different locations can input
data at the same time, the automated processing works on a compnent
model. The percentage of new characters that can be processed
automatically will show thw completeness or otherwise of such a model.
This project has a few years to run yet. As to whether the model will
be used outside of academic research I do not know.
> The current approach, that attempts to compose IDS using additional numeric
> positions for strokes is not very suitable for creating a normalization,
> there's some evidence that a more descriptive composition model could avoid
> using this graphical positional (i.e. without using x,y coordinates like it
> is now, because it does not work with various ideographic font styles, and
> these coordinates are not easily predictable).
I assume here by current approach you mean Wenlin's CDL, which is
based on cartesian co-ordinates. This is good for font making but bad
of a component based model. As you say the CDL is limited because it
givesjust one repesentation of a character. CJKV characters are not
formed based on a cartesian system, the component based model should
be based on the way characters are form, these comcepts are more
topological than cartesian.
> This work should be completed, and studied with various styles, to see what
> they have in common, and get a complete inventory of the accepted
> variations, so that these variations can be modelized and simplified. The
> IDC characters are just the start of this unfinished model. May be, the
> solution will be to add more IDC characters to encode the missing
> distinctions (and then apply the external IDS normalization rules, enhanced
> by these additional IDC's).
IDCs where not designed to be used of a component model. though it is
correct to say the current set of IDCs are imcomplete. Also imcomplete
are the set of radicals enconded.
This message sent through Virus Free Email
This archive was generated by hypermail 2.1.5 : Mon Oct 29 2007 - 15:40:29 CST