Re: UAX #14 (UCA): Derived primary weight ranges

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Mon, 12 Sep 2011 15:09:42 -0700

I don't think there is any particular value to that restructuring, from what
I can make of your email.

Note also, with regard to your message about 'real' weights, that there is
no requirement that implementations preserve the DUCET values, as long as
the ordering is the same. In particular, CLDR and many implementations use
the 'fractional' UCA weights, which are derived from the DUCET values, but
express weights using a variable number of bytes. These are similar to your
'rationals' but are really decimal value chunked into bytes, with some extra
features to allow interleaving and avoid overlap.

http://unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html

Mark
*— Il meglio è l’inimico del bene —*

On Sun, Sep 11, 2011 at 01:06, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> I think that the UCA forgets to specify which are the valid primary weights
> infered from the default rules used in the current DUCET.
>
> # Derived weight ranges: FB40..FBFF
> # [Hani] core primaries: FB40..FB41 (2)
> U+4E00..U+9FFF FB40..FB41 (2)
> U+F900..U+FAFF FB41 (1)
> # [Hani] extended primaries: FB80..FB9D (30)
> U+3400..U+4DBF FB80 (1)
> U+20000..U+EFFFF FB84..FB9D (29)
> # Other primaries: FBC0..FBE1 (34)
> U+0000..U+EFFFF FBC0..FBDD (30)
> U+F0000..U+10FFFF FBDE..FBE1 (4)
> # Trailing weights: FC00..FFFF (1024)
>
> It clearly exhibits that the currently assigned ranges of primary weights
> are way too large for the use.
>
> - Sinograms can fully be assigned a first primary weight within a set of
> only 32 values, instead of the 128 assigned.
>
> - This leaves enough place to separate the primary weights used by PUA
> blocks (both in the BMP or in planes 15 and 16), which just requires 1
> primary weight for the PUAs in the BMP, and 4 primary weights for the last
> two planes (if some other future PUA ranges are assigned, for example for
> RTL PUAs, we could imagine that this count of 5 weights would be extended
> to
>
> - All other primaries will never be assigned to anything outside planes 0
> to 14, and only for unassigned code points (whose primary weight value
> should probably be between the first derived primary weights for sinograms,
> and those from the PUA), so they'll never need more than 30 primary weights.
>
> Couldn't we remap these default bases for derived primary weights like
> this, and keep more space for the rest:
>
> # Derived weight ranges: FBB0..FBFF (80)
> # [Hani] core primaries: FBB0..FBB1 (2)
> U+4E00..U+9FFF FBB0 (1)
> (using base=U+2000 for the 2nd primary weight)
> U+F900..U+FAFF FBB1 (1)
> (using base=U+A000 for the 2nd primary weight)
> # [Hani] extended primaries: FBB2..FB9D (30)
> U+3400..U+4DBF FBB2 (1)
> (using base=U+2000 for the 2nd primary weight)
> reserved FBB3 (1)
> U+20000..U+EFFFF FBB4..FBCF (26)
> (using base=U+n0000 or U+n8000 for the 2nd primary weight)
> # Other non-PUA primaries: FBD0..FBEF (32)
> U+0000..U+EFFFF FBD0..FBED (30)
> (using base=U+n0000 or U+n8000 for the 2nd primary weight)
> reserved FBEE..FBEF (2)
> # PUA primaries: FBF0..FBFF (16)
> U+D800..U+DFFF FBF0 (1)
> (using base=U+n8000 for the 2nd primary weight)
> reserved FBF1..FBFB (11)
> U+F0000..U+10FFFF FBFC..FBFF (4)
> (using base=U+n0000 or U+n8000 for the 2nd primary weight)
> # Trailing weights: FC00..FFFF (1024)
>
> This scheme completely frees the range FB40..FBAF, while reducing the gaps
> currently left which will never have any use.
>
> (In this scheme, I have no opinion of which best range to use for code
> points assigned to non-characters, but they could all map to FBFF, used here
> for PUA, but with the second primary weight at end of the encoding space
> 8000..FFFF moved to 4000..BFFF so that the second primary weight for
> non-characters goes easily into C000..FFFF)
>
> This way, we would keep ranges available for future large non-sinographic
> scripts (pictographic, non-Han ideographic), that would probably use only
> derived weights, or for a refined DUCET containing more precise levels or
> gaps facilitating some derived collation tables (for example in CLDR).
>
> And all PUAs would clearly sort within dedicated ranges of primary weights,
> with a warranty of all being sorted at end, after all scripts.
>
> -- Philippe.
>
>
Received on Mon Sep 12 2011 - 17:13:41 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 12 2011 - 17:13:43 CDT