Re: Proposed Update UAXes for Unicode 6.1 from Philippe Verdy on 2011-07-08 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 8 Jul 2011 19:26:40 +0200

This is not related strictly related to this Unicode version update,
but I have an interesting question about the Unicode Stability Policy.

Summary: How does it apply to the exact value (or aliases) of the
property "Decomposition Type" (dt), for compatibility decomposition
mappings ?

In the strict definition applicable to Unicode 4.1+, the stability of
decompositions is defined in terms of idempotent normalizations across
versions of strings containing only characters that are assigned and
encoding in each versions, so that its decomposition mapping (i.e. the
list of code points to which each character is assigned) should be
stable.

With the weaker definition of Unicode 3.1+, this list of code points
could change (but this was fixed later so that it this mapping became
normalized with NFD), and it was permitted to fix some errors (under
some very limited conditions, and as exhibited in
NormalizationCorrections.txt listing corrigendas of Unicode 3.2 and
4.0).

But the weaker definition just speaks about a much simpler (reduced)
decomposition type, i.e. only "canonical" or "compatibility". If I
look precisely at the possible distinct values for the "dt" property,
this weaker stability would still apply a strict stability only to the
property values (and aliases, as defined in PropertyValueAliases.txt):

dt ; Can ; Canonical ; can
dt ; None ; none

But all the other values have to be interpreted as "compatiblity" for
the purpose of effectively implementing the four standard
normalizations (NFC, NFD, NFKC, NFKD), i.e. where the short value for
the "dt" property is any one of:

dt ; Com ; Compat ; com
dt ; Enc ; Circle ; enc
dt ; Fin ; Final ; fin
dt ; Font ; font
dt ; Fra ; Fraction ; fra
dt ; Init ; Initial ; init
dt ; Iso ; Isolated ; iso
dt ; Med ; Medial ; med
dt ; Nar ; Narrow ; nar
dt ; Nb ; Nobreak ; nb
dt ; Sml ; Small ; sml
dt ; Sqr ; Square ; sqr
dt ; Sub ; sub
dt ; Sup ; Super ; sup
dt ; Vert ; Vertical ; vert
dt ; Wide ; wide

Is this list of compatibility decomposition types subject to the
stability policy ? (Yes, new aliases may be added in implementations,
as long as they preserve them in the same classes of equivalence). But
could there be new compatibility decomposition types (still preserving
their uniqueness).

And can these types change (for example from "dt=Small" to
"dt=Narrow", or from "dt=Nobreak" to "dt=Compat")?

I've looked closely in the definition of other derived properties, and
it does not seem that the "dt" property is used for anything else than
implementing the normalizations (for example the word-breaking
properties do not depend on "dt=nb").

And it may eventually be convenient to have some characters with
compatibility decomposition mappings changed to exhibit better
decomposition mapping types (only to one of the existing values,
excluding possible future distinct values, as needed for the stability
rule "Property Alias Uniqueness" in Unicode 3.2+). Such change would
not break the idempotency of normalizations defined for Unicode 4.1+,
or even the weaker definition for Unicode 3.1+.

The strict rule for Unicode 4.1+ just says: "Decomposition Mapping:
Once a character is assigned, its decomposition mapping will not
change." But I wonder if this applies to the exact decomposition type
as explicited just below that, in the weaker definition, because it
just speaks about the value of the "decomposition mapping" property,
which does not contain itself the value of the "decomposition type"
property.

Even in the proposed update for TR44, the "Decomposition_Type" and
"Decomposition_Mapping" properties are defined separately (the first
one as an enumeration of property values listed in
PropertyValueAliases, the second one as a string made of code points
only).

If the large enumeration is in fact very weak (and not even needed for
warrantying the normalization idempotency) then we could as well
simplify it in the UCD to contain only "Can" (Canonical), "Com"
(Compatibility) and "None".

But we could as well make the reverse thing, by better refining the
list of compatibility types between the "<" and ">" brackets in the
main UnicodeData.txt file. And may be we could possibly adding
multiple values (except "Can" and "None"), but I fear that this could
break some existing UCD parsers that only expect letters between these
angle brackets to detect compatibility values, without even having to
check which value is specified between them, using a simple regexp
like /(<[A-Za-z]+> )/.

-- Philippe.
Received on Fri Jul 08 2011 - 12:29:37 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 08 2011 - 12:29:37 CDT