From: Philippe Verdy (email@example.com)
Date: Sat Nov 29 2003 - 20:12:17 EST
Michael Everson writes:
> Peter Constable wrote:
> > > I think the TDIL chart is wrong.
> >It seems reasonable that one should need extra persuasion to take
> >the word of an American living in Ireland over Indians. (Sorry.)
Isn't there a specific list for Brahmic scripts? (firstname.lastname@example.org ???).
We are near to explode the number of issues with these scripts if Indian
sources start publishing new undated references for their encoding and
conversion to Unicode, including proposed changes of orthographic rules to
better match either the phonology or the tradition or the inclusion of
SIL.org also is working quite actively in this area, in relation with a
proposed extended UTR22 reference for transcoding. But I'd like to see
discussions about proposed UTR22 changes in the main Unicode list.
There's not much isues with Thai as it has been standardized since long in
TIS620, which was the base of Unicode encoding (but shamely before UTR22 was
produced which would have allowed a better logical encoding without needing
lexical dictionnaries to parse the Thai text). Semantic analysis of Thai
text is an interesting issue by itself, but not for the correct way to
encode Thai words (TIS620 rules are clear as it mostly encodes glyphs,
expecting that readers will interpret the written text using their knowledge
of the language). So Thai discussions can remain in the main list.
I also think that Tibetan issues should be discussed in that list, despite
its composition model is very different from Brahmic scripts of India,
unless there's a specific rapporteur group for it.
But not Han issues which should be discussed possibly in their own list in
relation with the IRG workgroup (which already works on its own technical
reports as well as the standardization of the extended repertoire).
The recent issues I have read seem to multiply the number of Brahmic
conjuncts we have to deal with, possibly in relation with new normalization
forms (not NFC and NFD); as for Hebrew, there's probably a need for work in
these scripts with a separate discussion list, with the aim to produce a
technical report in accordance to Indian sources. Other related South Asian
scripts should be there too: Lao, Khmer...
My recent works with UCA and collation, as well as UTR22 and phonologic
analysis of many texts tend to promote the idea of new normalization forms
in all areas where NFC/NFD or even NFKC/NFKD are failing (we can't change
them due to the stability pact, but UCA and collation in general seems to
create a new coded character set (made of ordered collation weights
belonging to separate ranges for each collation level, these ranges being
sorted in the reverse order of the collation level).
I've tried to experiment a collation algorithm to implement UCA by the same
system as used in UCD decompositions, but with added (and sometimes
modified) decompositions. This system creates new "code points" needed to
represent only <font> compatibility differences, ligatures, or alternate
forms, as a decomposition of the existing compatibility character, into more
basic characters exposed with primary differences in UCA, plus these new
characters given "variable" collation weights, which may be ignorable in
applications which ignore extra levels. This encoding uses a 31 bit code
space, which is still highly compressible, but still representable with the
UTF-8 TES (but they are not containing Unicode code points) or similar
I am currently trying to adapt this system to work in relation with UTR22
transcodings, and I am testing it against Brahmic scripts, Hebrew, and
Latin. This is very promizing, and my next step will be to handle
decomposition of Han characters into their component radicals and strokes. I
do think that it is possible to handle almost all UCA and UTR22 rules by
using UTR22 itself and decomposition rules in a simple table matching nearly
the format of the UCD.
But all these discussions and encoding ambiguities of Brahmic scripts are
polluting my work. I am quite near to remove my current work on them, until
there's some agreement found, notably within an revision of ISCII if there's
one in preparation which will be more precise and will give more precise
rules. For now it is impossible for me to adapt my model with the proposed
(and sometimes contradictory) encoding solutions proposed by distinct
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Sat Nov 29 2003 - 21:02:09 EST