RE: Ambiquous compositions

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Dec 21 2000 - 21:14:10 EST


Mike Lischke asked:

>
> Btw: point 4 in "6 Composition Exclusion Table" (which should really only
> be "Composition
> exclusions" otherwise one could assume everything in this chapter is
> related to the exclusion table,
> which prevented me from understanding the singleton issue) requires to have
> the combining class (to
> check for a starter) which might not have been read from the database when
> the actual code point is
> parsed.

For the composition table for normalization (needed to implement
Forms C and Forms KC), you really should be thinking in terms of a
precompiled table -- not something you parse on the fly out of
the UnicodeData.txt file at runtime.

The (re)composition behavior is fixed by the Unicode 3.0.0 data table,
and will remain stable. The guarantees on maintaining normalization
stability are related to that. It is the *de*compositions that will get
extended as more characters are added to the standard. It is at that
end that you need to be able to update easily against the latest
UnicodeData.txt file when you are upgrading to a new version of the
standard.

And as a general principle on handling a dynamic parsing of UnicodeData.txt,
I try never to depend on some partial state, depending on where I
am in the data file. I parse the whole file, pulling out all fields
that I am concerned with to store in my data structures. And *then* do
the recursive decompositions and any other processing that I might need
to do, since the recursive decompositions, in principle, could refer
to any character from anywhere in the data file.

> The only way I can think of to handle this is to insert every
> non-singleton decomposition in
> a list and remove later all those which are non-starters. Any better ideas?

The non-starter decompositions are listed explicitly in
CompositionExclusions.txt. They are commented, because it is possible
to construct this information from UnicodeData.3.0.0.txt. However,
there is no particular reason to do so dynamically in an implementation.
Just build the four values (or rather the lack thereof):

# 0344 COMBINING GREEK DIALYTIKA TONOS
# 0F73 TIBETAN VOWEL SIGN II
# 0F75 TIBETAN VOWEL SIGN UU
# 0F81 TIBETAN VOWEL SIGN REVERSED II

into your precompiled table that you use for (re)composition.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT