Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 31 Aug 2011 02:27:29 +0200

After looking at the effective reason why this PRI #202 emerged (a
request from Perl authors), exposed in UTC document number
"L2/2011/11281", I think now that even *all* these aliases were not
needed.

The bug emerged in Perl only because a character named "BELL" was
added, entering in conflict with the *custom* (non standardized) value
alias that Perl used to reference the control.

I think the problem has been taken by the wrong end. Really, the UCS
namespace of characters has *never* been designed to allow any custom
alias. In other words, what Perl did, by adding those custom aliases,
was clearly not conforming to the standard.

What Perl should have used is not reusing the same property to
reference both the standard names (or aliases) and its own custom
aliases (even if those aliases are needed and widely known).

Now comes an interesting question: how can an application define its
own custom property types, with a warranty of never entering with an
existing or future property types. And how can those custom property
types be used in Regular expressions.

For now there's no private-use space for naming those custom property
types (yes, I'm speaking about the properties *types*, not the
property *values* such as here the standard character names, or the
standard character aliases, or the names and aliases of standard
character sequences, whose values all belong to the same standard UCD
namespace).

Is there such a naming scheme for these custom properties (for example
reserving property types starting by "x" or "q" or "_"), such that we
could write in a regexp: "[\p{_customtype=anyvalue}]", and where the
"anyvalue" will be warrantied to never enter in conflict with other
definition domains used by standard properties ?

With such scheme, it coudl then be possible to applications or
standards other than Unicode, to define their own set of properties. I
suggest custom property types that would be needed for impelmenting
other standards could use a separate prefix (such as "x-", example
"x-iso-", "x-ms-", etc.), that could be registered somewhere in the
UCD (only to avoid conflicts between sources of these custome
properties, but with a time limitation, that would have to be renewed
by its promoter, just to avoid to permanently filling this registry
for custom properties that no longer have any use), while custom
property types that would be specific to applications could use the
"_" prefix (meaning that it is application dependant), withiout any
prior formal registration (just like with PUAs).

Then the Unicode standard would not use these prefixes, but could
standardize all other properties with more freedom, and with a
warranty of stability for the long term (even for standard Unicode
properties that would become deprecated by other newer standard
properties that would highly be preferred).

===

Anyway, I recognize that it will still be helpful to have at least one
standard alias for those C0 and C1 <control> in the UCS namespace. But
there's no need to create multiple ones, unless there are naming
errors in those names that create severe confusion for identifying te
correct characters. For me, one and only one standard alias is needed
for those controls (and probably those names will be more helpful if
they are in their abbreviated form).

What this means is that even the common abbreviations like "ZWNJ" or
"BOM" are not needed (except only one unabbreviated alias for C0 and
C1 controls). If you really want to support them by some standardized
mean, put these names or abbreviations within another namespace,
accessible by another standard property type. This will avoid the
unnecessary pollution of the UCS namespace.

The UCS namespace was never designed to include abbreviated names (we
all know that abbreviations are very often ambiguous, so they are very
likely to conflict with each other, with many meanings). what this
means, is that the only preferred alias to add in the character
aliases is only one unabbreviated name of the control.

-- Philippe.
Received on Tue Aug 30 2011 - 19:31:11 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 30 2011 - 19:31:12 CDT