Re: CaseFirst and CaseLevel Tailorings of UCA and LDML from Ken Whistler on 2012-05-21 (Unicode Mail List Archive)

From: Ken Whistler <kenw_at_sybase.com>
Date: Mon, 21 May 2012 17:43:27 -0700

On 5/21/2012 4:37 PM, Richard Wordingham wrote:
> Again, even the interpretation of uppercase in terms of weights is not
> certain, for the ISO/IEC 14651:2007 example of a tailoring for
> uppercase first does not adjust the collation elements with a tertiary
> weight of 1C, although they are listed as uppercase in Section 7.2 of
> the UCA and the standard collation table of ISO/IEC 14651 calls the
> weight MISCCAP.

Don't expect any ISO/IEC 14651:2007 example to be worked out in excruciating
detail for all the edge cases. That is what UCA, LDML, and ICU are for.

>
> There are a few out and out anomalies in the tertiary weights of
> primary non-ignorables, even allowing for intelligent hand correction
> of the decompositions in UnicodeData.txt. Is a compliant
> implementation free to classify as lowercase or uppercase dependent on
> the appearance? For example, when caseFirst is set to uppercase, ICU
> orders U+1D34 MODIFIER LETTER CAPITAL H before U+0068 LATIN SMALL
> LETTER H, but anomalously order U+A7F8 MODIFIER LETTER CAPITAL
> H WITH STROKE*after* U+0127 LATIN SMALL LETTER H WITH STROKE becaue
> the latter's tertiary weight identifies it as<super> with no entry for
> 'Case or kana subtype' class. Is this behaviour required by the UCA +
> DUCET?

Well, that may be a bug in allkeys.txt.

0068 ; [.1699.0020.0002.0068] # LATIN SMALL LETTER H
0048 ; [.1699.0020.0008.0048] # LATIN CAPITAL LETTER H
02B0 ; [.1699.0020.0014.02B0] # MODIFIER LETTER SMALL H
2095 ; [.1699.0020.0015.2095] # LATIN SUBSCRIPT SMALL LETTER H
1D34 ; [.1699.0020.*001D*.1D34] # MODIFIER LETTER CAPITAL H
1F137 ; [.1699.0020.*001D*.1F137] # SQUARED LATIN CAPITAL LETTER H
0127 ; [.1699.0020.0002.0068][.0000.007D.0002.0335] # LATIN SMALL
LETTER H WITH STROKE
210F ; [.1699.0020.0002.210F][.0000.007D.0002.210F] # PLANCK CONSTANT
OVER TWO PI
0126 ; [.1699.0020.0008.0048][.0000.007D.0002.0335] # LATIN CAPITAL
LETTER H WITH STROKE
A7F8 ; [.1699.0020.*0014*.A7F8][.0000.007D.*0014*.A7F8] # MODIFIER
LETTER CAPITAL H WITH STROKE

The 0014 gives U+A7F8 as a <super> tertiary weight, but doesn't also mark
it as uppercase. The tertiary weight 001D gives a character a <super> (or
<sub> or <square>) tertiary weight and also marks it as uppercase. The
default
tertiary weights aren't completely separated into all the possible
combinations
here, because the required weighting space gets out of hand, and seems
unnecessary for the edge cases for compatibility characters, at least
for *default* weighting of such.

But this *particular* modifier letter (U+A7F8) would behave better for
case tailoring if
it was weighted by default with 001D as the tertiary.

Part of the problem here is that the whole class of h's with strokes is
forced
to have *secondary* weights based on U+0335, even though they don't
have decompositions in UnicodeData.txt. (That was by an explicit UTC
decision some time back.) And the initial weight assignment
algorithm is apparently interacting with that forced generation of secondary
weight in a way which doesn't detect that U+A7F8 would better default to
tertiary 001D instead of tertiary 0014.

This *could* be fixed by special case patching the initial weight assignment
algorithm, but that is the kind of jiggering that seems ill-advised for
these
kinds of down-in-the-noise edge cases. Anybody who *really* cares about the
tertiary uppercase-first tailoring of U+A7F8, which itself is an extremely
rare character, used only in UPA, and not even in core UPA, but in
extensions
for it, and not known to occur in any orthography, but only in specialized
phonetic transcriptions of only *some* languages -- and even *then* the
tertiary order could only make a difference in the ordering of strings
including
this character, but then the UPA strings don't have case differences,
because
it is a technical phonetic transcription, not ordinary cased text, and
all the
other primary and secondary differences would overwhelm the tertiary
distinctions for almost all data, anyway. If even in
*those* circumstances, somebody required uppercase-first tailoring
to work without exception for U+A7F8, well, then the solution for that is
simply to tailor the default tertiary weight from 0014 to 001D.

--Ken
Received on Mon May 21 2012 - 19:45:37 CDT

This archive was generated by hypermail 2.2.0 : Mon May 21 2012 - 19:45:43 CDT