Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code A Points

From: verdy_p (verdy_p@wanadoo.fr)
Date: Wed Aug 04 2010 - 16:50:09 CDT

  • Next message: Doug Ewell: "Re: Dialects and orthographies in BCP 47"

    "Kenneth Whistler"
    > > Currently, if the Unicode scalar value (or invalid code unit) is NNNN
    > > (unsigned 32-bit value), then they are treated as expansions to
    > > ignorable collation elements:
    > > [.0000.0000.0000.NNNN]
    >
    > That statement is incorrect. The UCA currently specifies that
    > ill-formed code unit sequences and *noncharacters* are mapped
    > to [.0000.0000.0000.], but unassigned code points are not.

    This is exactly equivalent: if you use strength level 3, they are both [.0000.0000.0000], if you need semi-stable
    sort keys, you NEED to add in your sortkeys the binary representation of scalar values. And the UCA already accepts
    the fact that ill-formed sequences may still be sorted without an error. The only way to do that with a semi-stable
    sort key, is to also include this scalar value as a final level in your sort key, even if it's ill-formed.

    > > If we want to be smarter, we should not treat ALL the cases above as
    > > fully ignorable at the first three levels, and should get primary
    > > weights notably:
    >
    > Hmmm, if we want to be smarter, we should read what the actual
    > specification says.

    That's what I did. If there's a contradiction for you, that's because the specification is amiguous on these points.
    I've read and re-read it many times before concluding that this was NOT fully specified (and then permitted under my
    interpretation).

    > > so that they with primary weights lower than than those used for
    > > characters in the same block, but still higher that encoded characters
    > > from other blocks have that lower primary weights than assigned
    > > characters in the block. Gaps should be provided in the DUCET at
    > > the begining of ranges for these blocks so that they can all fit
    > > in them. The benefit being also that other blocks after them will
    > > keep their collation elements stable and won't be affected by the
    > > new allocations in one block.
    >
    > That particular way of assigning implicit weights for unassigned
    > characters would be a complete mess to implement for the default
    > table.

    Yes, I admit that it would create huge gaps everywhere, but it's not so critical for sinograms, that are encoded in
    a very compact way, with NO gap at all (given that they are assigned primary weights algorithmically from their
    scalar value). So mapping sinograms using the same scheme, even if they are still not encoded but at least within
    the assigned blocks or planes will make NO difference in the DUCET.

    > A. It would substantially increase the size of the default table
    > for *all* users, because it would assign primary weights for
    > all unassigned code points inside blocks -- code points which
    > now simply get implicit weights assigned by rule.

    Yes, I admit it.

    > B. The assumptions about better default behavior are erroneous,
    > because they presuppose things which are not necessarily true. In
    > particular, the main goal appears to be to assure well-behavedness
    > for future additions on a per-script basis, since primary weight
    > orders are relevant to scripts. However, several of the most important
    > scripts are now, for historical reasons, encoded in multiple
    > blocks. A rule which assigns default primary weights on a per
    > block basis for unassigned characters would serve no valid purpose
    > in such cases.

    You can perfectly exclude positions that have been left unassigned in blocks only for compatibility reasons. We
    should know which they are (and in fact Unicode should then list them as permanently invalid characters. If it does
    not, it's because Unicode and ISO 10646 are still keeping the possibility of encoding new characters there, but this
    should only be for the relevant scripts to which these positions were left unallocated.

    > C. In addition to randomizing primary weight assignments for
    > scripts in the case of multiple-block scripts, such a rule would
    > also introduce *more* unpredictability in cases of the punctuation
    > and symbols which are scattered around among many, many blocks,
    > as well.

    No, it would not, by default and as long as they are not encoded, they will sort within the script to which these
    blocks were allocated. You can perfectly list all the relevant blocks that should be assigned weights together.

    > In general this proposal fails to understand that the default
    > weights for DUCET (as expressed in allkeys.txt) has absolutely
    > nothing whatsover to do with block identities or block
    > ranges in the standard. The weighting algorithm knows absolutely
    > nothing about block values.

    Really ? Yes it depends more or less on the general category, but most additions in the existing blocks are for
    letters. Given that they sort at after all scripts, they already have an "unordered" position in collation. When
    they will be encoded, they will have to move anyway to their final position. This proposal does not suppress this
    possibility.

    > > The other categories above (for code units exceeding the range of
    > > valid scalar values if they are not treated as errors, or for code
    > > points with valid scalar values and assigned to non-characters if they
    > > are not treated as errors, or for code points with valid scalar values
    > > assigned or reserved in the special supplementary plane) can be kept
    > > as fully ignorable, using null weights on the (fully ignorable) first
    > > three levels, and the implicit (last level) weights for scalar value
    > > or code unit binary weights.
    >
    > Except that such treatment is not optimal for the noncharacters.
    > As noted in the review note in the proposed update for UTS #10,
    > noncharacters should probably be given implicit weights, rather
    > than being treated as ignorables by default. That is a proposed
    > change to the specification.

    Yes, I agree with you on this change of rule: they are permanently assigned, they have a meaning if they are ever
    used (even if it would create ill-formed sequences not representable in standard UTF's, and not interchangeable),
    and they can't be ignored silently.

    The best that can be done is to allow sorting them all at end, with triling weights, instead of being skipped
    silently, for easier identification.

    But anyway, given that such use would be purely local, applications are still free to handle them the way they want,
    according to their own local private conventions.

    The only interesting option that would be "portable" would be if one wanted to collate texts according to UTF-16
    code unit scalar values, instead of Unicode scalar values, for the last binary weights appended to sort keys for
    semi-stability (this would be done for interoperability reasons with legacy systems that still do not support
    supplementary planes, i.e. the deprecating implementation Level 1 of ISO 10646), so that the supplementary
    characters really encoded with surrogates (especially the supplementary sinograms, but also all the newly encoded
    historical scripts) would not be fully ignored.



    This archive was generated by hypermail 2.1.5 : Wed Aug 04 2010 - 16:51:40 CDT