UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code Points

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Aug 01 2010 - 13:54:49 CDT

  • Next message: Mark Davis ☕: "Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code Points"

    Implicit weights for unassigned code points and other characters that
    are NOT ill-formed are suboptimal, as noted in the proposed update.

    It should take into account their existing default properties, notably :

    > 1. Code units exceeding the valid range for code points with scalar values (such as 0x110000 or 0xFFFFFFFF, when handling invalid UTF-32), if they are not handled as errors for collation.
    > 2. Code points with valid scalar values that are permanently assigned to non-characters, if they are not handled as errors for collation:
    > > 2.1. surrogates; or:
    > > 2.2. others scalar values (such as U+FFFF).
    > 3. Their presence in a block or plane assigned to Sinographs ("Unified Ideographs"), either:
    > > 3.1. Unassigned code points that are in allocated "Core" Sinographs blocks (currently, "CJK compatibility" or "CJK unified", all in the BMP), or:
    > > 3.2. Unassigned code points that are in allocated "Other" Sinographs blocks or planes (currently, "CJK Unified Ideographs Extension A" in the BMP, and all reserved code points in the SIP).
    > 4. Unassigned code points that are assigned to "Special" character (notably in the supplementary special plane (SSP) starting at U+E0000).
    > 5. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default RTL directionality (in the BMP or SMP).
    > 6. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default LTR directionality (in the BMP or SMP).

    Currently, if the Unicode scalar value (or invalid code unit) is NNNN
    (unsigned 32-bit value), then they are treated as expansions to
    ignorable collation elements:
      [.0000.0000.0000.NNNN]

    This means that they will always be ignored, except at the final
    implicit level comparing scalar values in binary order. However this
    is not reasonnable for many of them.

    Note that the weight for last implicit binary level is included in the
    DUCET, but it exceeds the 16-bit capacity for weights, and this level
    is probably split in several successive collation elements, using a
    mechanism similar to surrogates (except that surrogates don't have the
    correct binary order); as this has to take into account the
    possibility of code units exceeding the capacity of valid scalar
    values but accepting any unsigned 32-bit code unit, this could be
    simply:
    > if (NNNN in 0x0000..0xFFFF), then only one collation element is needed: [.0000.0000.0000.NNNN]; otherwise
    > if (NNNN >= 0xFFFF), use three collation elements: [.0000.0000.0000.FFFF][.0000.0000.0000.HHHH][.0000.0000.0000.LLLL], where HHHH=(NNNN>>16) and LLLL=(NNNN&0xFFFF).

    Note that for this fourth (last implicit) collation level, run-length
    compression does not apply, as it is present as well for all valid
    encoded character, contractions or expansions, and will be used as
    well for all the cases above (treating them as ignorables on the first
    3 levels).

    If we want to be smarter, we should not treat ALL the cases above as
    fully ignorable at the first three levels, and should get primary
    weights notably:

    > 3.1. Unassigned code points that are in allocated "Core" Sinographs blocks (currently, "CJK compatibility" or "CJK unified", all in the BMP).
    > > When they will be allocated, they will sort using the implicit weights, my opinion is that they should use the mechanism exposed using:
    > > [.AAAA.0020.0002.][.BBBB.0000.0000.]
    > > where AAAA=0xFB40+(NNNN>>15) and BBBB=0x8000+(NNNN&0x7FFF);
    > > There's no reason to maintain their unstable collation elements depending on Unicode versions, when we can already predict what will be their collation elements.

    > 3.2. Unassigned code points that are in allocated "Other" Sinographs blocks or planes (currently, "CJK Unified Ideographs Extension A" in the BMP, and all reserved code points in the SIP).
    > > When they will be allocated, they will sort using the implicit weights, my opinion is that they should use the mechanism exposed using:
    > > [.AAAA.0020.0002.][.BBBB.0000.0000.]
    > > where AAAA=0xFB80+(NNNN>>15) and BBBB=0x8000+(NNNN&0x7FFF);
    > > There's no reason to maintain their unstable collation elements depending on Unicode versions, when we can already predict what will be their collation elements.

    > 5. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default RTL directionality (in the BMP or SMP).
    > 6. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default RTL directionality (in the BMP or SMP).
    >> When they will be allocated, most of them will NOT be fully ignorable, and its probably best to give them appropriate implicit primary weights so that they with primary weights lower than than those used for characters in the same block, but still higher that encoded characters from other blocks have that lower primary weights than assigned characters in the block. Gaps should be provided in the DUCET at the begining of ranges for these blocks so that they can all fit in them. The benefit being also that other blocks after them will keep their collation elements stable and won't be affected by the new allocations in one block.

    The other categories above (for code units exceeding the range of
    valid scalar values if they are not treated as errors, or for code
    points with valid scalar values and assigned to non-characters if they
    are not treated as errors, or for code points with valid scalar values
    assigned or reserved in the special supplementary plane) can be kept
    as fully ignorable, using null weights on the (fully ignorable) first
    three levels, and the implicit (last level) weights for scalar value
    or code unit binary weights.

    Note that valid PUAs are not concerned here: they have not in the
    DUCET, even if they are subject to possible private tailorings to make
    them fully ignorable or use any other weights (including with
    contractions or expansions). Without such known private convention,
    they should still be treated as fully ignorable (using the implivit
    weights for the last level sorting by scalar values). But the UCA
    algorithm completely forgets to speak about them, so it treats them
    with BASE=0xFBC0, giving non-zero primary weights and making them sort
    after all Sinographs and before 'Trailing weights'...

    Philippe.



    This archive was generated by hypermail 2.1.5 : Sun Aug 01 2010 - 14:01:18 CDT