Re: UAX #14 (UCA): Derived primary weight ranges from Philippe Verdy on 2011-09-12 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 13 Sep 2011 06:02:50 +0200

Not really, because the derived primary weights are already present in the
DUCET, within the expansions of CJK compatibility characters. e need to take
into account how the DUCET values are computed, in order to deduce between
which sequences containing CJK ideographs the compatibility characters will
fit in an ordered sequence of strings, including at the primary level of
collation.

Yes renumbering is not an easy task if we use the current DUCET format,
which does not offer an easy view of how it is really structured. That's why
I really don't like the fact that it fixes arbitrary weight values without
really exposing their relativeproperties. I much prefer the compressed
syntax used in LDML, like: "&a<<<A<<à<<<À<b<<<B...", which does not fix any
arbitrary weights, because it is clearly not needed; we have much more
freedom in how weights are generated, and the syntax is much more
expressive, without the additional (unneeded in fact) fixing of those
arbitrary weights values.

That's because I'm implementing this idea (for now I still have problems
with computing the contextual rules, notably with a context on the right
side, but absolutely no problem in representing the pseudo collation
elements for minimum script primaries). What I am in fact writing is even
another representation where I just set:
  level1 = [Latn], a, b, ..., [Grek], ..., [Cyrl], ...
  level2 = a, à, b, ...
  level3 = a, A, à, À, b, B
using a simple and very compact format (not requiring any weight values)
using simple ordered lists, and absolutely NO complex operators between
them. The commas here are just illustrative of the ordered list format,
which is implicit, it can be abbreviated as well as:
  level1=[Latn]a-z...[Grek]...[Cyrl]...
  level2=aàb...
  level3=aAàÀbB

No need of any reset (they are implicit from the presence of a collation
element at a level N+1 which is already positioned in the list for level N,
I can still use delimiters only ro represent contractions, and I don't even
need to represent expansions.

If needed I can add supplementary statistics data about usage of primary
ranges in a specific language. From this data, I deduce rational weights, I
can also infer an optimal Huffman or arithmetic coding (only needed when
generating collation keys, such data is not needed for just comparing
Unicode strings, because I always can always split the half-open range of
rationals [0..1[ in as many partitiions as wanted: no fixed bit precision).

Adding a tailoring just consists in specifying an ordered list of collation
elements (single characters or contractions), each list specifying the
collation level at which the collation elements are diffrentiated, and I
also don't need to load the levels 2..N at all if just performing a level 1
collation, and this is enough as well for representing contextual collation
elements as well.

Things like collation mappings are much easier to perceive and specify
correctly. And this format can even represent all case mappings and case
foldings existing in the UCD, just as specialized but limited collations.
Compatibility decomposition mappings of the UCD are also represented as a
tailoring. And even the canonical decomposition mappings are also
represented by a "level-infinite" list like above (but it cannot represent
the canonical combining classes).

I no longer need to specify the gaps for interleavings, because they are
implicitly present at all levels, and almost infinitely tailorable (up to
the max precision of rationals).

The same format can be used to represent the mappings used for transcodings
from/to non-Unicode encodings, once again as specialized tailoring.

I can also represent the collation level numbers as well as rationals (for
example, with Hangul, I can set lists for level 1 only listing the leading
consonnants, level 1.3 listing these consonnants and LV syllables and vowel
jamos, and the level 1.6 for adding LCT syllables and trailing consonnants:
no more need of "trailing weights")

May be you don't see the interest for now. Simply because it is very
different from what you have designed in ICU. But much of the complicate
cases and exceptions that you need to handle with complex code in ICU would
be simplified a lot.

Finally my initial desire when posting the comment about derived weights was
about how PUAs are currently ordered (mixed at the primary level with all
other unassigned codepoints and non-characters). when they should be treated
like a script by themselves, and should be representable using the
abbreviated LDML format like "& [Hani] < [Qqqq]" and reorderable easily in
any tailoring if one does not want them to be ordered after all sinograms
and all other collation elements (except possibly the set of non-characters,
and surrogates).

Philippe.

2011/9/13 Mark Davis ☕ <mark_at_macchiato.com>

> I don't think there is any particular value to that restructuring, from
> what I can make of your email.
>
> Note also, with regard to your message about 'real' weights, that there is
> no requirement that implementations preserve the DUCET values, as long as
> the ordering is the same. In particular, CLDR and many implementations use
> the 'fractional' UCA weights, which are derived from the DUCET values, but
> express weights using a variable number of bytes. These are similar to your
> 'rationals' but are really decimal value chunked into bytes, with some extra
> features to allow interleaving and avoid overlap.
>
> http://unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html
>
> Mark
> *— Il meglio è l’inimico del bene —*
>
>
> On Sun, Sep 11, 2011 at 01:06, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
>> I think that the UCA forgets to specify which are the valid primary
>> weights infered from the default rules used in the current DUCET.
>>
>> # Derived weight ranges: FB40..FBFF
>> # [Hani] core primaries: FB40..FB41 (2)
>> U+4E00..U+9FFF FB40..FB41 (2)
>> U+F900..U+FAFF FB41 (1)
>> # [Hani] extended primaries: FB80..FB9D (30)
>> U+3400..U+4DBF FB80 (1)
>> U+20000..U+EFFFF FB84..FB9D (29)
>> # Other primaries: FBC0..FBE1 (34)
>> U+0000..U+EFFFF FBC0..FBDD (30)
>> U+F0000..U+10FFFF FBDE..FBE1 (4)
>> # Trailing weights: FC00..FFFF (1024)
>>
>> It clearly exhibits that the currently assigned ranges of primary weights
>> are way too large for the use.
>>
>> - Sinograms can fully be assigned a first primary weight within a set of
>> only 32 values, instead of the 128 assigned.
>>
>> - This leaves enough place to separate the primary weights used by PUA
>> blocks (both in the BMP or in planes 15 and 16), which just requires 1
>> primary weight for the PUAs in the BMP, and 4 primary weights for the last
>> two planes (if some other future PUA ranges are assigned, for example for
>> RTL PUAs, we could imagine that this count of 5 weights would be extended
>> to
>>
>> - All other primaries will never be assigned to anything outside planes 0
>> to 14, and only for unassigned code points (whose primary weight value
>> should probably be between the first derived primary weights for sinograms,
>> and those from the PUA), so they'll never need more than 30 primary weights.
>>
>> Couldn't we remap these default bases for derived primary weights like
>> this, and keep more space for the rest:
>>
>> # Derived weight ranges: FBB0..FBFF (80)
>> # [Hani] core primaries: FBB0..FBB1 (2)
>> U+4E00..U+9FFF FBB0 (1)
>> (using base=U+2000 for the 2nd primary weight)
>> U+F900..U+FAFF FBB1 (1)
>> (using base=U+A000 for the 2nd primary weight)
>> # [Hani] extended primaries: FBB2..FB9D (30)
>> U+3400..U+4DBF FBB2 (1)
>> (using base=U+2000 for the 2nd primary weight)
>> reserved FBB3 (1)
>> U+20000..U+EFFFF FBB4..FBCF (26)
>> (using base=U+n0000 or U+n8000 for the 2nd primary weight)
>> # Other non-PUA primaries: FBD0..FBEF (32)
>> U+0000..U+EFFFF FBD0..FBED (30)
>> (using base=U+n0000 or U+n8000 for the 2nd primary weight)
>> reserved FBEE..FBEF (2)
>> # PUA primaries: FBF0..FBFF (16)
>> U+D800..U+DFFF FBF0 (1)
>> (using base=U+n8000 for the 2nd primary weight)
>> reserved FBF1..FBFB (11)
>> U+F0000..U+10FFFF FBFC..FBFF (4)
>> (using base=U+n0000 or U+n8000 for the 2nd primary weight)
>> # Trailing weights: FC00..FFFF (1024)
>>
>> This scheme completely frees the range FB40..FBAF, while reducing the gaps
>> currently left which will never have any use.
>>
>> (In this scheme, I have no opinion of which best range to use for code
>> points assigned to non-characters, but they could all map to FBFF, used here
>> for PUA, but with the second primary weight at end of the encoding space
>> 8000..FFFF moved to 4000..BFFF so that the second primary weight for
>> non-characters goes easily into C000..FFFF)
>>
>> This way, we would keep ranges available for future large non-sinographic
>> scripts (pictographic, non-Han ideographic), that would probably use only
>> derived weights, or for a refined DUCET containing more precise levels or
>> gaps facilitating some derived collation tables (for example in CLDR).
>>
>> And all PUAs would clearly sort within dedicated ranges of primary
>> weights, with a warranty of all being sorted at end, after all scripts.
>>
>> -- Philippe.
>>
>>
>
Received on Mon Sep 12 2011 - 23:06:19 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 12 2011 - 23:06:20 CDT