Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other CodeÿA Points

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 04 2010 - 19:20:59 CDT

  • Next message: André Szabolcs Szelp: "Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)"

    > > That statement is incorrect. The UCA currently specifies that
    > > ill-formed code unit sequences and *noncharacters* are mapped
    > > to [.0000.0000.0000.], but unassigned code points are not.
    >
    > This is exactly equivalent: if you use strength level 3, they are
    > both [.0000.0000.0000], ...

    You have missed the point entirely.

    In your original note in this thread you listed a whole series of
    types of code points *including* unassigned code points, and
    made the blanket claim that "they are treated as expansions to
    ignorable collation elements:"

    I was correcting your claim about unassigned code points, which are
    *not* "treated as expansions to ignorable collation elements."

    > > Hmmm, if we want to be smarter, we should read what the actual
    > > specification says.
    >
    > That's what I did. If there's a contradiction for you, that's because
    > the specification is amiguous on these points.

    UTS #10, Version 5.2.0, Section 7.1.2, Legal Code Points:

    "Any other legal code point that is not explicitly mentioned in the
    table is mapped [to] a sequence of two collation elements as descibed
    in Section 7.1.3, Implicit Weights."

    Since "any other legal code point" in this context clearly includes
    all unassigned code points, I fail to see any ambiguity here.
     
    > I've read and re-read it many times before concluding that this
    > was NOT fully specified (and then permitted under my
    > interpretation).

    Hmmm.

    >
    > > > so that they with primary weights lower than than those used for
    > > > characters in the same block, but still higher that encoded characters
    > > > from other blocks have that lower primary weights than assigned
    > > > characters in the block. Gaps should be provided in the DUCET at
    > > > the begining of ranges for these blocks so that they can all fit
    > > > in them. The benefit being also that other blocks after them will
    > > > keep their collation elements stable and won't be affected by the
    > > > new allocations in one block.
    > >
    > > That particular way of assigning implicit weights for unassigned
    > > characters would be a complete mess to implement for the default
    > > table.
    >
    > Yes, I admit that it would create huge gaps everywhere, but it's not so
    > critical for sinograms, that are encoded in
    > a very compact way, with NO gap at all (given that they are assigned
    > primary weights algorithmically from their
    > scalar value). So mapping sinograms using the same scheme, even if
    > they are still not encoded but at least within
    > the assigned blocks or planes will make NO difference in the DUCET.

    Actually, your conclusion is exactly backwards. *Currently* all of the
    unassigned characters in the Unified Ideograph blocks are given
    implicit weights by the rules in Section 7.1.3 (all based on FBC0).

    If you applied your scheme to the Unified Ideograph blocks, then all
    of the unassigned ranges in those blocks would get explicit primary weights on
    a per-block basis -- and tracking those weights *would* make the tables bigger.

    > > A. It would substantially increase the size of the default table
    > > for *all* users, because it would assign primary weights for
    > > all unassigned code points inside blocks -- code points which
    > > now simply get implicit weights assigned by rule.
    >
    > Yes, I admit it.

    O.k., so we agree on something. ;-)

    > > B. The assumptions about better default behavior are erroneous,
    > > because they presuppose things which are not necessarily true. In
    > > particular, the main goal appears to be to assure well-behavedness
    > > for future additions on a per-script basis, since primary weight
    > > orders are relevant to scripts. However, several of the most important
    > > scripts are now, for historical reasons, encoded in multiple
    > > blocks. A rule which assigns default primary weights on a per
    > > block basis for unassigned characters would serve no valid purpose
    > > in such cases.
    >
    > You can perfectly exclude positions that have been left unassigned
    > in blocks only for compatibility reasons. We
    > should know which they are (and in fact Unicode should then list
    > them as permanently invalid characters. If it does
    > not, it's because Unicode and ISO 10646 are still keeping the possibility
    > of encoding new characters there, but this
    > should only be for the relevant scripts to which these positions were left unallocated.

    Neither the UTC or WG2 would *ever* agree to marking some code points
    within blocks as "permanently invalid characters". There is no basis
    for doing so, whatever the original sources of a particular repertoire.
    There have been dozens of cases of later decisions to "fill in the gaps"
    in some block with later additions.

    And again, you seem to have totally missed my point. Latin characters
    that get primary weights in DUCET are now encoded in Unicode in at *least*
    eleven separate blocks. The primary weights for Latin characters
    are decided by an input file (in a *single* primary order) completely
    independently of those blocks. By your scheme, if you start assigning
    primary weights to gaps in blocks containing Latin characters, how
    do you assign them? What Latin block takes precedence? And how is assigning
    such gaps a set of primary weights, which would necessarily be unrelated to
    the primary order of the *assigned* characters -- which isn't based on
    block identity--be any better than random, anyway?

    > > C. In addition to randomizing primary weight assignments for
    > > scripts in the case of multiple-block scripts, such a rule would
    > > also introduce *more* unpredictability in cases of the punctuation
    > > and symbols which are scattered around among many, many blocks,
    > > as well.
    >
    > No, it would not, by default and as long as they are not encoded, they will sort
    > within the script to which these
    > blocks were allocated. You can perfectly list all the relevant blocks
    > that should be assigned weights together.

    I wasn't talking about a *script*. I was talking about generic punctuation
    and symbols which are scattered around among many blocks -- unrelated
    to scripts.
     
    > > In general this proposal fails to understand that the default
    > > weights for DUCET (as expressed in allkeys.txt) has absolutely
    > > nothing whatsover to do with block identities or block
    > > ranges in the standard. The weighting algorithm knows absolutely
    > > nothing about block values.
    >
    > Really ?

    Yes, really.

    > Yes it depends more or less on the general category, but most additions
    > in the existing blocks are for
    > letters. Given that they sort at after all scripts, they already have
    > an "unordered" position in collation. When
    > they will be encoded, they will have to move anyway to their final
    > position. This proposal does not suppress this
    > possibility.

    Then what is the point? All this accomplishes -- as I stated above -- is
    to make the default table bigger, without improving the results one bit.

    > > Except that such treatment is not optimal for the noncharacters.
    > > As noted in the review note in the proposed update for UTS #10,
    > > noncharacters should probably be given implicit weights, rather
    > > than being treated as ignorables by default. That is a proposed
    > > change to the specification.
    >
    > Yes, I agree with you on this change of rule: they are permanently assigned,
    > they have a meaning if they are ever
    > used (even if it would create ill-formed sequences not representable in
    > standard UTF's, and not interchangeable),
    > and they can't be ignored silently.

    O.k., something else we can agree on. ;-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Aug 04 2010 - 19:23:47 CDT