Re: Rationale wanted for Unicode identifier rules

From: Tex Texin (
Date: Sat Mar 04 2000 - 23:27:54 EST

On successive upgrades, #2-

No language can guarantee downward compatibility (without
remaining static itself).

As long as character properties, once defined, do not change, then
at least XML can have upward compatibility, so something defined
with XML 1.0 should continue to work with successive upgrades.

I don't see why it would be unacceptable to have the scenario
where XML 1.0 files continue to work, but something that
takes advantage of functionality in a later version requires
that version or later. It's pretty much the way of the world.

There is perhaps one other scenario-
If there is a way for an XML file to optionally carry a definition of
character properties with it, then it can be downward compatible.
Of course you wouldn't want to define all characters, maybe just those
define later than some version. Then it would be able to be parsed
by versions down to whatever version was desired.

(Perhaps you would want to validate that no character that was defined
in the parser was not having its properties changed or overridden.
I am not sure if this is needed or not yet.)

It would mean a parser would have to be able to append to its
character property table for the duration of the processing of the
file and then return to its original state for the next file.


Mark Davis wrote:
> In general, I agree with the discussion here: identifiers should be chosen on the basis of character properties. As new characters are assigned, they are given appropriate properties, and the class of possible identifiers grows. There are, however, difficulties with this approach in certain contexts.
> Take, for example, XML identifiers. The difference in this case is that the identifiers occur in structured data, not program text. This data will live for years. The conformance requirements for XML identifiers are very strict. This is absolutely correct, since it guarantees compatibility around the world. But what this means is that the current, conformant XML parsers cannot accept new Unicode 3.0 letters in identifiers. There are a few main approaches to identifiers in XML, listed below.
> [One note that is relevant to all of these: while <identifier_extend> includes Cf, these character should be filtered when composing identifiers, so there are actually 4 relevant categories for parsing identifiers. However, there are reserved blocks (2060-206F and E0000-E1000) now for Cf characters, so they should not present a problem.]
> 1. Status quo.
> Never accept characters outside of Unicode 2.0 in identifiers. Downside: new scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged -- forever.
> <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl -- as of Unicode 2.0
> <identifier_extend> := Mn, Mc, Nd, Pc, Cf -- as of Unicode 2.0
> <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 2.0
> 2. Successive upgrades.
> Revise XML with each version of Unicode. This means you will have XML 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage: it takes years for compliant parsers to be fully spread across the world. During that time, data interchange between different versions of parsers cannot be guaranteed. I believe this will be unacceptable to the XML community.
> 3. Open Season.
> Define identifiers to be *fixed* as of Unicode 3.0, but to also include unassigned characters (Cn) as of that version.
> Identifiers are thus fixed for all time. They include all new letters that will be defined. Disadvantage: they will also include new punctuation, symbols, etc. defined post-3.0.
> <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cn -- as of Unicode 3.0
> <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cn -- as of Unicode 3.0
> <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 3.0
> 4. Restricted Open Season.
> The Unicode consortium divides up the unassigned space in more detail, and specifies that <excluded> characters and <identifier_extend> characters can only be allocated in the future within certain blocks. This has the effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While characters will change from each of these to other properties over time as characters become assigned, the three relevant categories will remain unchanged.
> Since future allocations will not disturb the identifier syntax, identifiers are thus fixed for all time. Disadvantage: the consortium as a whole has resisted such assignment of blocks for unassigned characters in the past (except Cf).
> <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cni
> <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cne
> <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So, AND Cnx
> Mark

Progress is a proud sponsor of the 16th International Unicode Conference
March 27-30, 2000 in Amsterdam, Holland
See our panel on Open Source Approaches to Unicode Libraries
Tex Texin                     Director, International Products
Progress Software Corp.       +1-781-280-4271
14 Oak Park                   +1-781-280-4655 (Fax)
Bedford, MA 01730  USA The #1 Embedded Database JMS Compliant Messaging- Best Middleware Award Leading provider in the ASP marketplace

Progress Globalization Program ------------------------------------------------------------------------------------------------ Spanish Proverb: Don't speak unless you can improve on the silence. Tex's Proverb: Don't email unless you can improve on the screen saver.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT