L2/06-210

-------------------------------------------------------------------------------
Analyzing the Proposed Solutions re: Issue about Named Character Sequences

Asmus Freytag (May 6, 2006)
 

Thanks to Ken for laying out five options in handling the
recently discovered issues with named sequences. His analysis
focused on possible changes to the text of the standard. In my
opinion this overlooks the fact that, like encoding of characters,
there's another level, which in WG2 is somewhat formalized
as 'Principles and Procedures', in other words, the rules that
the committee follows in deciding what to add (or in this case
what to name). Unlike formal restrictions, principles allow a
judgment on a case-by-case basis, while presenting a strong
bias in favor, or against, particular outcomes.

Here's my evaluation of the five options:

On 5/5/2006 6:28 PM, Kenneth Whistler wrote:
> The issue is this:
>
> It is possible to have multiple different combining
> character sequences that are canonically equivalent,
> and which should render the same and which to an
> end user represent the "same" character.
>
> As it stands now, UAX #34, Unicode Named Character
> Sequences, spells out the definition and syntax for
> specifying named character sequences, but is silent
> on the issue of how canonically equivalent sequences
> should be handled in that context.
>
> The namespace uniqueness requirements for character
> names, formal name aliases, and named character sequences
> would prevent the option of having one particular
> name being reused for canonically equivalent sequences,
> unless that is defined in some special way to prevent name
> clashes.
>
> ...
> Here are the alternatives as I see them.
>
> A. Status Quo
>
> No restrictions on standardizing distinct named character
> sequences for canonically equivalent sequences.
>
> Require that each such named sequence have a distinct
> name.
>
>
For the purposes of UAX#34, I am in favor of this option (A).

Sequences that consist of different character codes are different (as
sequences), even if they are equivalent on the level of what abstract
character they represent. UAX#34 is about naming sequences. I see little
gained by re-targeting it so it names equivalence classes instead.

However, as a matter of working policy of the encoding committees, I
favor an approach where the committees refuse by default to give a
*standard* name to more than one sequence in each equivalence class.

In addition to having an agreed principle on that we prefer to name only
sequences that are in NFC, I propose that we address the issue of
canonical equivalences where it impacts us most, and that is in the
definition of *Collections*, which are intended to enumerate
sub-repertoires in 10646. Here we clearly intend to refer to the
underlying entities, not on their encoded forms.

And unlike the definition of sequences and named sequences there's
no precedent and no existing definitions and other normative language
that would have to be revised. More on that below.

My preferred approach looks superficially like option C

> C. Restrict to any one equivalent sequence
>
> Require that only one of any set of canonically equivalent
> sequences be standardized, but not require that that one
> necessarily be NFC.
>
> This option would also automatically prevent the need to
> clone names for canonically equivalent sequences.
>
>
but the important difference is where the limit is established. If it is
in the principles and procedures, then it allows the UTC and WG2 to deal
with exceptional circumstances, should they arise, which has obvious
advantages. We've all experienced the contortions that the committee has
gone through whenever we've froze some aspects of the standard only to
later encounter an exception.

It also allows us a graceful way to handle the few non-NFC sequences
that we have named today. On a case-by-case basis, we can allow named
NFC equivalents, and henceforth adopt a principle to by default only
accept new names for sequences that are in NFC already.

There is an important difference between doing this by principle or
incorporating it inflexibly into the language of the standard as in
Option B:
> B. Restrict to NFC
>
> Require that *only* sequences in NFC be standardized as
> a named character sequence.
>
> This would automatically prevent the need to clone names
> for canonically equivalent sequences.
>
>
We cannot do Option B retroactively, as the *current* set of sequences
are not all in NFC. We would have to re-state any existing non-NFC
sequence (which is something that we should not do) or we would need to
grandfather it.

More importantly, it would forever prevent us from giving a name to a
non-normalized sequence. Why would we want that option? Just one
example: there are cases when the typing order for characters is a
non-normalized sequence. I see no benefit in making it impossible for us
to ever give a name to such a typed-in sequence.

The rationale for naming sequences is to address the need to give a
handle to things that are not already encoded atomically. I am strongly
convinced that we cannot a-priori decide what situations may arise in
which we might wish to have a name for one or the other specific
sequence. Adoption of a *restriction* ties our hands unnecessarily in
this regard. Adoption of a *principle* achieves the end of avoiding
needless clutter in the list, while preserving essential flexibility.
> D. Standardize *all* canonically equivalent sequences as a set
>
> Modify the syntax of NamesSequences.txt (if necessary), and
> verify that when a named sequence is added, *all* canonically
> equivalent sequences are associated with the same name.
>
> This option also automatically prevents name cloning.
>
This goes even further away from naming sequences towards naming
equivalence classes. I think this is the wrong direction to go.

Further, it requires substantial change in file syntax and normative
language. Because this is an issue that is shared between 10646 and
Unicode, we would need to change the normative language and file format
in both standards.

Overkill and too destabilizing.
> E. Allow identical names for canonical equivalents
>
> This option would relax the namespace uniqueness requirement
> in that it would stipulate that any canonically equivalent
> sequence could be made a named sequence, but it would be
> required to have the same name as a prior standardized
> named character sequence.
>
> This option explicitly allows *identical* names for
> canonical equivalent sequences, and would prohibit the
> kind of distinct naming listed above for #1 - #4.
>
>
>From my perspective, this is the worst option. Verifying that names are
unique currently relies on making sure that there is only one instance
of each name in UnicodeDate.txt, NamedSequences.txt
NamedSequencesProv.txt, the working copy of the Namelist for the pending
version, and the working copy of any additions to the list of named
sequences. Having to juggle five files (in different formats) in order
to prevent us from violating name uniqueness is bad enough, but having
to parse and normalize all the sequences on top of it just adds to the
problem. Yes it can be done, but the more complicated it is, the fewer
people will be able to run independent checks.

I strongly counsel to stay away from this approach.


-------------------------------------------------------------------
Background
=========

Here are the relevant citations from 10646:

Clause 29 clearly states:

  A named UCS Sequence Identifier (USI) is a USI associated to a name

Clause 6.6 clearly states:


   ISO/IEC 10646 defines an identifier for any sequence of code
   positions taken from the standard. Such an identifier is known as a
   UCS Sequence Identifier (USI). For a sequence of n code positions it
   has the following form:


               <UID1, UID2, ..., UIDn>

   ...The UCS Sequence Identifier shall include at least two UIDs;

That very clearly makes the named entity the particular sequence, and
*not* the abstract character behind it.

These clauses corresponds to the definition in UAX#34: Unicode Named Character
Sequences

D1    
 
Unicode named character sequence: A specific sequence of two or more Unicode characters, together with a formal name designating that sequence.
 


Again, it is clear that the sequence is being named, not what it stands for.

Finally, our stability policy states:

 
    Named character sequences will not be changed or removed. 
    This stability guarantee applies both to the name of the named 
    character sequence and to the sequence of characters so named.

.