Re: Corrigendum #9

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Sat, 31 May 2014 22:02:57 +0200

A few quick items. (I admit to only skimming your response, Phillipe; there
is only so much time in the day.)

Any discussion of changing non-characters is really pointless. See
http://www.unicode.org/policies/property_value_stability_table.html

As to breaking up the block, that is not forbidden: but one would have to
give pretty compelling arguments that the benefits would outweigh any
likely problems, especially since we already don't recommend the use of the
block property in regexes.

> And regular expressions trying to use character properties have many more
caveats to handle (the most serious being with canonical equivalences and
discontinuous matches or partial matches.

The UTC, after quite a bit of work, concluded that it was not feasible with
today's regex engines to handle normalization automatically, instead
recommending the approach in
http://www.unicode.org/reports/tr18/#Canonical_Equivalents

> Regexps are still a very experimental proposal, they are still very
difficult to make interoperatable except in a small set of tested cases

I have no idea where this is coming from. Regexes using Unicode properties
are in widespread and successful use. It is not that hard to make them
interoperable (as long as both implementations are using the same version
of Unicode).

Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*

On Sat, May 31, 2014 at 9:36 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> May be; but there's real doubt that a regular expression that would need
> this property would be severely broken if that property was corrected.
> There are many other properties that are more useful (and mich more used)
> whose associated set of codepoints changes regularly across versions.
>
> I don't see any specific interest in maintaining non-characters in that
> block, as it effectively reduces the reusaibility of this property.
> And in fact it would be highly preferable to no longer state that these
> non-characters in ArabicPresenationForm be treated like C1 controls or PUA
> (because they will ever be reassigned to something more useful). Making
> them PUA would not change radically the fact thzt these characters are not
> recommended but we xould no longer bother about checking if they are valid
> or not. They remain there only as a legacy with old outdated versions of
> Unicode for a mysterious need that I"ve not clearly identified.
>
> Let's assume we change them into PUA; some applications will start
> accepting them when some other won't. Not a problem given that they are
> already not interoperable.
>
> And regular expressions trying to use character properties have many more
> caveats to handle (the most serious being with canonical equivalences and
> discontinuous matches or partial matches; when searches are only focuing on
> exact sets of code points instead of sets of canonical equivalent texts;
> the other complciation coming with the effect of collation and its variable
> strength matching more or less parts of text spanning ignorable collation
> elements i.e, possibly also, discontinuous runs of ignorable codepoints if
> we want to get consistant results independant of th normalization form.
> more compicate is how to handle "partial matches" such as a combining
> character within a precomposed character which is canonically equivalent to
> string where this combining character appears
>
> And even more tricky is how to handle substitution with regexps, for
> example when perfrming search at primary collation level ignoring
> lettercase, but when we wnt to replace base letters but preserve case in
> the substituted string: this requires specific lookup of characters using
> properties **not** specified in the UCD but in the collation tailoring
> data, and then how to ensure that the result of the substitution in the
> plain-text source will remain a valid text not creating new unexpected
> canonical equivalences, and that it will also not break basic orthographic
> properties such as syllabic structures in a specific pair of
> language+script, and without also producing unexpected collation
> equivalents at the same collation strength; causing later unexpected never
> ending loops of subtitutions, for example in large websites with bots
> operating text corrections).
>
> Regexps are still a very experimental proposal, they are still very
> difficult to make interoperatable except in a small set of tested cases and
> for this reason I really doubt that the "characetrs encoding block"
> property is very productive for now with regexps (and notably not with this
> "compatibility" block, whose characters wll remain used isolately
> independantly of their context, if they are still used in rare cases).
>
> I see little value in keeping this old complication in this block, but
> just more interoperability problems for implementations. So these non
> characters should be treated mostly like PUA, except that they have a few
> more properties : direction=RTL, script= Arabic, and starters working in
> isolation for the Arabic joining type (these properties can help limit
> their generic reusability like regular PUAs but at least all other
> processes and notably generic validtors won't have to bother about them).
>
> 2014-05-31 18:17 GMT+02:00 Asmus Freytag <asmusf_at_ix.netcom.com>:
>
> On 5/31/2014 4:09 AM, Philippe Verdy wrote:
>>
>> 2014-05-30 20:49 GMT+02:00 Asmus Freytag <asmusf_at_ix.netcom.com>:
>>
>>> This might have been possible at the time these were added, but now it
>>> is probably not feasible. One of the reasons is that block names are
>>> exposed (for better or for worse) as character properties and as such are
>>> also exposed in regular expressions. While not recommended, it would be
>>> really bad if the expression with pseudo-code
>>> "IsInArabicPresentationFormB(x)" were to fail, because we split the block
>>> into three (with the middle one being the noncharacters).
>>>
>>
>> If you think about pseudocode testing for properties then nothing
>> forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead
>> of just one.
>>
>> Besides the point.
>>
>> The issue is that the result of evaluation of an expression would change
>> over time.
>>
>> A./
>>
>>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sat May 31 2014 - 15:04:40 CDT

This archive was generated by hypermail 2.2.0 : Sat May 31 2014 - 15:04:40 CDT