RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)

From: vunzndi@vfemail.net
Date: Mon Oct 29 2007 - 15:36:41 CST

Next message: vunzndi@vfemail.net: "RE: Level of Unicode support required for various languages"

Previous message: Peter Constable: "RE: thorn vs. y or th, eth and other similar letters/signs"
Next in thread: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)"
Maybe reply: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Quoting Philippe Verdy <verdy_p@wanadoo.fr>:

> vunzndi@vfemail.net wrote:
>> When I have talked with Chinese publishers about IT difficulties the
>> most common issue raised by far is how to add characters, the number
>> of which would be reduced to almost zero if a composite model than
>> precompossed model was used.
>>
>> Stabilty rules about cannonical equivalance may well be the biggest
>> obstacle.
>
> It is an obstacle only if considering the current encoding of IDS as a
> graphical linear orthography, that has NO canonical equivalence with the
> characters they "represent". In fact they don't represent them but describe
> them weakly.
>
> In order to build a composing model for Han, it would be required to only
> only include new IDS characters, these ones having a non-descriptive but
> compositive property; on addition, it would be impossible (stability) to
> redecompose the existing Han characters that are already singleton in both
> NFC and NFD. It would even be impossible to decompose them using NFKC/NFKD.
>

My apologies for an inexact terminology here - the esssnce of what I
wished to say is as you say, that a decompositonal model that
decomposes existing encoded characters would break stability rules.

> So a completely new composition model would have to be adopted, distinct
> from the one used with NFC/NFD. Certainly, most of the work already
> performed with IDS/IDC could be kept to create this model, but for now, the
> 20% that remain are not satisfyingly described, and that's a lot of work to
> get something reliable.
>

Much of the outstanding 20% can also be dealt with fairly quickly,but
would need something other than IDS. Overall the component model would
save time.

In the current research I am doing the need is to catalogue and
analysis a large number of texts, including an estimate 10 000
unencoded characters. The aim is to automate the process as far as
possible so a number for researchers in different locations can input
data at the same time, the automated processing works on a compnent
model. The percentage of new characters that can be processed
automatically will show thw completeness or otherwise of such a model.
This project has a few years to run yet. As to whether the model will
be used outside of academic research I do not know.

> The current approach, that attempts to compose IDS using additional numeric
> positions for strokes is not very suitable for creating a normalization,
> there's some evidence that a more descriptive composition model could avoid
> using this graphical positional (i.e. without using x,y coordinates like it
> is now, because it does not work with various ideographic font styles, and
> these coordinates are not easily predictable).
>

I assume here by current approach you mean Wenlin's CDL, which is
based on cartesian co-ordinates. This is good for font making but bad
of a component based model. As you say the CDL is limited because it
givesjust one repesentation of a character. CJKV characters are not
formed based on a cartesian system, the component based model should
be based on the way characters are form, these comcepts are more
topological than cartesian.

> This work should be completed, and studied with various styles, to see what
> they have in common, and get a complete inventory of the accepted
> variations, so that these variations can be modelized and simplified. The
> IDC characters are just the start of this unfinished model. May be, the
> solution will be to add more IDC characters to encode the missing
> distinctions (and then apply the external IDS normalization rules, enhanced
> by these additional IDC's).
>

IDCs where not designed to be used of a component model. though it is
correct to say the current set of IDCs are imcomplete. Also imcomplete
are the set of radicals enconded.

Yours sincerely
John Knightely

>
>
>

-------------------------------------------------
This message sent through Virus Free Email
http://www.vfemail.net

Next message: vunzndi@vfemail.net: "RE: Level of Unicode support required for various languages"
Previous message: Peter Constable: "RE: thorn vs. y or th, eth and other similar letters/signs"
Next in thread: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)"
Maybe reply: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicodesupport required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 29 2007 - 15:40:29 CST