UAX #14 (UCA): Derived primary weight ranges

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 11 Sep 2011 10:06:27 +0200

I think that the UCA forgets to specify which are the valid primary weights
infered from the default rules used in the current DUCET.

# Derived weight ranges: FB40..FBFF
# [Hani] core primaries: FB40..FB41 (2)
      U+4E00..U+9FFF FB40..FB41 (2)
      U+F900..U+FAFF FB41 (1)
# [Hani] extended primaries: FB80..FB9D (30)
      U+3400..U+4DBF FB80 (1)
      U+20000..U+EFFFF FB84..FB9D (29)
# Other primaries: FBC0..FBE1 (34)
      U+0000..U+EFFFF FBC0..FBDD (30)
      U+F0000..U+10FFFF FBDE..FBE1 (4)
# Trailing weights: FC00..FFFF (1024)

It clearly exhibits that the currently assigned ranges of primary weights
are way too large for the use.

- Sinograms can fully be assigned a first primary weight within a set of
only 32 values, instead of the 128 assigned.

- This leaves enough place to separate the primary weights used by PUA
blocks (both in the BMP or in planes 15 and 16), which just requires 1
primary weight for the PUAs in the BMP, and 4 primary weights for the last
two planes (if some other future PUA ranges are assigned, for example for
RTL PUAs, we could imagine that this count of 5 weights would be extended
to

- All other primaries will never be assigned to anything outside planes 0 to
14, and only for unassigned code points (whose primary weight value should
probably be between the first derived primary weights for sinograms, and
those from the PUA), so they'll never need more than 30 primary weights.

Couldn't we remap these default bases for derived primary weights like this,
and keep more space for the rest:

# Derived weight ranges: FBB0..FBFF (80)
# [Hani] core primaries: FBB0..FBB1 (2)
      U+4E00..U+9FFF FBB0 (1)
        (using base=U+2000 for the 2nd primary weight)
      U+F900..U+FAFF FBB1 (1)
        (using base=U+A000 for the 2nd primary weight)
# [Hani] extended primaries: FBB2..FB9D (30)
      U+3400..U+4DBF FBB2 (1)
        (using base=U+2000 for the 2nd primary weight)
      reserved FBB3 (1)
      U+20000..U+EFFFF FBB4..FBCF (26)
        (using base=U+n0000 or U+n8000 for the 2nd primary weight)
# Other non-PUA primaries: FBD0..FBEF (32)
      U+0000..U+EFFFF FBD0..FBED (30)
        (using base=U+n0000 or U+n8000 for the 2nd primary weight)
      reserved FBEE..FBEF (2)
# PUA primaries: FBF0..FBFF (16)
      U+D800..U+DFFF FBF0 (1)
        (using base=U+n8000 for the 2nd primary weight)
      reserved FBF1..FBFB (11)
      U+F0000..U+10FFFF FBFC..FBFF (4)
        (using base=U+n0000 or U+n8000 for the 2nd primary weight)
# Trailing weights: FC00..FFFF (1024)

This scheme completely frees the range FB40..FBAF, while reducing the gaps
currently left which will never have any use.

(In this scheme, I have no opinion of which best range to use for code
points assigned to non-characters, but they could all map to FBFF, used here
for PUA, but with the second primary weight at end of the encoding space
8000..FFFF moved to 4000..BFFF so that the second primary weight for
non-characters goes easily into C000..FFFF)

This way, we would keep ranges available for future large non-sinographic
scripts (pictographic, non-Han ideographic), that would probably use only
derived weights, or for a refined DUCET containing more precise levels or
gaps facilitating some derived collation tables (for example in CLDR).

And all PUAs would clearly sort within dedicated ranges of primary weights,
with a warranty of all being sorted at end, after all scripts.

-- Philippe.
Received on Sun Sep 11 2011 - 03:10:38 CDT

This archive was generated by hypermail 2.2.0 : Sun Sep 11 2011 - 03:10:40 CDT