Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code Points

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 02 2010 - 18:43:58 CDT

  • Next message: Karl Pentzlin: "Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"

    Philippe Verdy said:

    > Implicit weights for unassigned code points and other characters that
    > are NOT ill-formed are suboptimal, as noted in the proposed update.

    To follow up on Mark's response on this thread...

    >
    > It should take into account their existing default properties, notably :

    [ long list snipped: includes surrogates, noncharacters, and
    unassigned characters in various ranges ]

    > Currently, if the Unicode scalar value (or invalid code unit) is NNNN
    > (unsigned 32-bit value), then they are treated as expansions to
    > ignorable collation elements:
    > [.0000.0000.0000.NNNN]

    That statement is incorrect. The UCA currently specifies that
    ill-formed code unit sequences and *noncharacters* are mapped
    to [.0000.0000.0000.], but unassigned code points are not.

    > If we want to be smarter, we should not treat ALL the cases above as
    > fully ignorable at the first three levels, and should get primary
    > weights notably:

    Hmmm, if we want to be smarter, we should read what the actual
    specification says.

    > > 5. Unassigned code points that are in allocated blocks for
    > > non-Sinographs, non-Special, and with default RTL directionality
    > > (in the BMP or SMP).
    > > 6. Unassigned code points that are in allocated blocks for
    > > non-Sinographs, non-Special, and with default RTL directionality
    > > (in the BMP or SMP).
    > >> When they will be allocated, most of them will NOT be fully ignorable,
    > >> and its probably best to give them appropriate implicit primary weights

    They already are, but...

    > so that they with primary weights lower than than those used for
    > characters in the same block, but still higher that encoded characters
    > from other blocks have that lower primary weights than assigned
    > characters in the block. Gaps should be provided in the DUCET at
    > the begining of ranges for these blocks so that they can all fit
    > in them. The benefit being also that other blocks after them will
    > keep their collation elements stable and won't be affected by the
    > new allocations in one block.

    That particular way of assigning implicit weights for unassigned
    characters would be a complete mess to implement for the default
    table.

    A. It would substantially increase the size of the default table
    for *all* users, because it would assign primary weights for
    all unassigned code points inside blocks -- code points which
    now simply get implicit weights assigned by rule.

    B. The assumptions about better default behavior are erroneous,
    because they presuppose things which are not necessarily true. In
    particular, the main goal appears to be to assure well-behavedness
    for future additions on a per-script basis, since primary weight
    orders are relevant to scripts. However, several of the most important
    scripts are now, for historical reasons, encoded in multiple
    blocks. A rule which assigns default primary weights on a per
    block basis for unassigned characters would serve no valid purpose
    in such cases.

    C. In addition to randomizing primary weight assignments for
    scripts in the case of multiple-block scripts, such a rule would
    also introduce *more* unpredictability in cases of the punctuation
    and symbols which are scattered around among many, many blocks,
    as well.

    In general this proposal fails to understand that the default
    weights for DUCET (as expressed in allkeys.txt) has absolutely
    nothing whatsover to do with block identities or block
    ranges in the standard. The weighting algorithm knows absolutely
    nothing about block values.

    > The other categories above (for code units exceeding the range of
    > valid scalar values if they are not treated as errors, or for code
    > points with valid scalar values and assigned to non-characters if they
    > are not treated as errors, or for code points with valid scalar values
    > assigned or reserved in the special supplementary plane) can be kept
    > as fully ignorable, using null weights on the (fully ignorable) first
    > three levels, and the implicit (last level) weights for scalar value
    > or code unit binary weights.

    Except that such treatment is not optimal for the noncharacters.
    As noted in the review note in the proposed update for UTS #10,
    noncharacters should probably be given implicit weights, rather
    than being treated as ignorables by default. That is a proposed
    change to the specification.

    > Note that valid PUAs are not concerned here: they have not in the
    > DUCET, even if they are subject to possible private tailorings to make
    > them fully ignorable or use any other weights (including with
    > contractions or expansions). Without such known private convention,
    > they should still be treated as fully ignorable (using the implivit
    > weights for the last level sorting by scalar values).

    No, they should not.

    > But the UCA
    > algorithm completely forgets to speak about them, so it treats them
    > with BASE=0xFBC0, giving non-zero primary weights and making them sort
    > after all Sinographs and before 'Trailing weights'...

    Correct. But is by design -- not because the algorithm completely
    forgets to speak about them.

    Although I agree that it would be a good idea to call the PUA
    out explicitly as subject to the implicit weighting, so that
    people are not unclear about this.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Aug 02 2010 - 18:47:23 CDT