Re: Rationale wanted for Unicode identifier rules

From: Mark Davis (markdavis@ispchannel.com)
Date: Sun Mar 05 2000 - 21:54:53 EST

Next message: Doug Ewell: "Re: Rationale for U+10FFFF?"
Previous message: Timothy Partridge: "Re: Durability of ISO/IEC 10646-1:2000"
In reply to: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Next in thread: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Reply: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Because XML files will be distributed broadly, as broadly as HTML, that means is that in effect nobody can use the new characters, because you can't be sure that parsers on the other side will accept them. This is different than the compiler case, where you have much more control over the versioning. Since this is data, it can potentially end up anywhere.

Your suggestion of adding the additional identifier characters (categorized into 2 groups) is a possibility. If restricted to the actual characters required by the document it might not be too bad. I'd have to ask the parser people whether this would be easy or hard (it would definitely impact performance, I suspect).

Mark

Tex Texin wrote:

> On successive upgrades, #2-
>
> No language can guarantee downward compatibility (without
> remaining static itself).
>
> As long as character properties, once defined, do not change, then
> at least XML can have upward compatibility, so something defined
> with XML 1.0 should continue to work with successive upgrades.
>
> I don't see why it would be unacceptable to have the scenario
> where XML 1.0 files continue to work, but something that
> takes advantage of functionality in a later version requires
> that version or later. It's pretty much the way of the world.
>
> There is perhaps one other scenario-
> If there is a way for an XML file to optionally carry a definition of
> character properties with it, then it can be downward compatible.
> Of course you wouldn't want to define all characters, maybe just those
> define later than some version. Then it would be able to be parsed
> by versions down to whatever version was desired.
>
> (Perhaps you would want to validate that no character that was defined
> in the parser was not having its properties changed or overridden.
> I am not sure if this is needed or not yet.)
>
> It would mean a parser would have to be able to append to its
> character property table for the duration of the processing of the
> file and then return to its original state for the next file.
>
> tex
>
> Mark Davis wrote:
> >
> > In general, I agree with the discussion here: identifiers should be chosen on the basis of character properties. As new characters are assigned, they are given appropriate properties, and the class of possible identifiers grows. There are, however, difficulties with this approach in certain contexts.
> >
> > Take, for example, XML identifiers. The difference in this case is that the identifiers occur in structured data, not program text. This data will live for years. The conformance requirements for XML identifiers are very strict. This is absolutely correct, since it guarantees compatibility around the world. But what this means is that the current, conformant XML parsers cannot accept new Unicode 3.0 letters in identifiers. There are a few main approaches to identifiers in XML, listed below.
> >
> > [One note that is relevant to all of these: while <identifier_extend> includes Cf, these character should be filtered when composing identifiers, so there are actually 4 relevant categories for parsing identifiers. However, there are reserved blocks (2060-206F and E0000-E1000) now for Cf characters, so they should not present a problem.]
> >
> > 1. Status quo.
> > Never accept characters outside of Unicode 2.0 in identifiers. Downside: new scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged -- forever.
> >
> > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl -- as of Unicode 2.0
> > <identifier_extend> := Mn, Mc, Nd, Pc, Cf -- as of Unicode 2.0
> > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 2.0
> >
> > 2. Successive upgrades.
> > Revise XML with each version of Unicode. This means you will have XML 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage: it takes years for compliant parsers to be fully spread across the world. During that time, data interchange between different versions of parsers cannot be guaranteed. I believe this will be unacceptable to the XML community.
> >
> > 3. Open Season.
> > Define identifiers to be *fixed* as of Unicode 3.0, but to also include unassigned characters (Cn) as of that version.
> >
> > Identifiers are thus fixed for all time. They include all new letters that will be defined. Disadvantage: they will also include new punctuation, symbols, etc. defined post-3.0.
> >
> > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cn -- as of Unicode 3.0
> > <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cn -- as of Unicode 3.0
> > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 3.0
> >
> > 4. Restricted Open Season.
> > The Unicode consortium divides up the unassigned space in more detail, and specifies that <excluded> characters and <identifier_extend> characters can only be allocated in the future within certain blocks. This has the effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While characters will change from each of these to other properties over time as characters become assigned, the three relevant categories will remain unchanged.
> >
> > Since future allocations will not disturb the identifier syntax, identifiers are thus fixed for all time. Disadvantage: the consortium as a whole has resisted such assignment of blocks for unassigned characters in the past (except Cf).
> >
> > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cni
> > <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cne
> > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So, AND Cnx
> >
> > Mark
>
> --
> Progress is a proud sponsor of the 16th International Unicode Conference
> March 27-30, 2000 in Amsterdam, Holland
> http://www.unicode.org/iuc/iuc16/index.html
> See our panel on Open Source Approaches to Unicode Libraries
> http://www.unicode.org/iuc/iuc16/a206.html
> ------------------------------------------------------------------------------------------------
> Tex Texin Director, International Products
>
> Progress Software Corp. +1-781-280-4271
> 14 Oak Park +1-781-280-4655 (Fax)
> Bedford, MA 01730 USA texin@bedford.progress.com
>
> http://www.progress.com The #1 Embedded Database
> http://www.SonicMQ.com JMS Compliant Messaging- Best Middleware
> Award
> http://www.aspconnections.com Leading provider in the ASP marketplace
>
> Progress Globalization Program
> http://www.progress.com/services/partners/globalization/index.htm
> ------------------------------------------------------------------------------------------------
> Spanish Proverb: Don't speak unless you can improve on the silence.
> Tex's Proverb: Don't email unless you can improve on the screen saver.

Next message: Doug Ewell: "Re: Rationale for U+10FFFF?"
Previous message: Timothy Partridge: "Re: Durability of ISO/IEC 10646-1:2000"
In reply to: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Next in thread: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Reply: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT