Re: Corrigendum #9

From: Philippe Verdy <>
Date: Sat, 31 May 2014 21:36:37 +0200

May be; but there's real doubt that a regular expression that would need
this property would be severely broken if that property was corrected.
There are many other properties that are more useful (and mich more used)
whose associated set of codepoints changes regularly across versions.

I don't see any specific interest in maintaining non-characters in that
block, as it effectively reduces the reusaibility of this property.
And in fact it would be highly preferable to no longer state that these
non-characters in ArabicPresenationForm be treated like C1 controls or PUA
(because they will ever be reassigned to something more useful). Making
them PUA would not change radically the fact thzt these characters are not
recommended but we xould no longer bother about checking if they are valid
or not. They remain there only as a legacy with old outdated versions of
Unicode for a mysterious need that I"ve not clearly identified.

Let's assume we change them into PUA; some applications will start
accepting them when some other won't. Not a problem given that they are
already not interoperable.

And regular expressions trying to use character properties have many more
caveats to handle (the most serious being with canonical equivalences and
discontinuous matches or partial matches; when searches are only focuing on
exact sets of code points instead of sets of canonical equivalent texts;
the other complciation coming with the effect of collation and its variable
strength matching more or less parts of text spanning ignorable collation
elements i.e, possibly also, discontinuous runs of ignorable codepoints if
we want to get consistant results independant of th normalization form.
more compicate is how to handle "partial matches" such as a combining
character within a precomposed character which is canonically equivalent to
string where this combining character appears

And even more tricky is how to handle substitution with regexps, for
example when perfrming search at primary collation level ignoring
lettercase, but when we wnt to replace base letters but preserve case in
the substituted string: this requires specific lookup of characters using
properties **not** specified in the UCD but in the collation tailoring
data, and then how to ensure that the result of the substitution in the
plain-text source will remain a valid text not creating new unexpected
canonical equivalences, and that it will also not break basic orthographic
properties such as syllabic structures in a specific pair of
language+script, and without also producing unexpected collation
equivalents at the same collation strength; causing later unexpected never
ending loops of subtitutions, for example in large websites with bots
operating text corrections).

Regexps are still a very experimental proposal, they are still very
difficult to make interoperatable except in a small set of tested cases and
for this reason I really doubt that the "characetrs encoding block"
property is very productive for now with regexps (and notably not with this
"compatibility" block, whose characters wll remain used isolately
independantly of their context, if they are still used in rare cases).

I see little value in keeping this old complication in this block, but just
more interoperability problems for implementations. So these non characters
should be treated mostly like PUA, except that they have a few more
properties : direction=RTL, script= Arabic, and starters working in
isolation for the Arabic joining type (these properties can help limit
their generic reusability like regular PUAs but at least all other
processes and notably generic validtors won't have to bother about them).

2014-05-31 18:17 GMT+02:00 Asmus Freytag <>:

> On 5/31/2014 4:09 AM, Philippe Verdy wrote:
> 2014-05-30 20:49 GMT+02:00 Asmus Freytag <>:
>> This might have been possible at the time these were added, but now it is
>> probably not feasible. One of the reasons is that block names are exposed
>> (for better or for worse) as character properties and as such are also
>> exposed in regular expressions. While not recommended, it would be really
>> bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)"
>> were to fail, because we split the block into three (with the middle one
>> being the noncharacters).
> If you think about pseudocode testing for properties then nothing
> forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead
> of just one.
> Besides the point.
> The issue is that the result of evaluation of an expression would change
> over time.
> A./

Unicode mailing list
Received on Sat May 31 2014 - 14:37:55 CDT

This archive was generated by hypermail 2.2.0 : Sat May 31 2014 - 14:37:55 CDT