Re: property, character, and sequence name loose matching

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 16 2010 - 17:57:48 CST

Next message: karl williamson: "Re: property, character, and sequence name loose matching"

Previous message: jandersen@talentex.co.uk: "Re: Århus mayor prefers Aarhus - "believing the ‘Å’ is a hindrance in international communication""
Maybe in reply to: karl williamson: "property, character, and sequence name loose matching"
Next in thread: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Asmus (responding to Karl Williamson) noted:

> Fine, you've made your point that
>
> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all
> medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>
> * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
> "zerowidthspace"
> * "character -a" is /not/ equivalent to "character a"
>
> could be improved to note the interaction between the presence/absence
> of spaces and "medial". (I believe that's actually in the works).

Indeed:

http://www.unicode.org/reports/tr44/tr44-5.html#Matching_Rules

for the Unicode 6.0 proposed update draft. Now is everybody's
chance to comment if anything about that clarification is still
problematical.

> > As an aside, it has been my experience that ignoring all white space
> > usually leads to unintended negative consequences. The 1966 ANSI
> > Fortran standard suffered from this (I don't know about later
> > versions), and it led to problems, with economic consequences. It is
> > a pity that this lesson did not get passed on to later generations. I
> > doubt that Unicode really wants 'S c r i p t' to mean 'Script', but
> > that's what it says. It would have been better in my opinion for it
> > to say that multiple white space is equivalent to a single white space.
>
> That's a good point, even though you misrepresent the intention of
> Unicode. Of interest here is not the folding of multiple spaces into one
> as much as allowing CamelCase version of names (instead of UPPER CASE or
> lower case with spaces).
>
> At the same time, it there are some names, esp. charater names, where
> users might disagree about where to add spaces. It was felt useful to
> allow the use not only of fewer spaces, but also of more spaces than the
> formal name.

There are examples such as U+003C LESS-THAN SIGN, where one wouldn't
want what might be a fairly common spelling for a match,
"less than sign" not to match the formal name "LESS-THAN SIGN".

In Hangul jamo letter names like U+112C HANGUL CHOSEONG
KAPYEOUNSSANGPIEUP, the last part is actually four syllables,
KAP YEOUN SSANG PIEUP, and you might not know where somebody
would or would not add spaces -- or hyphens, for that matter.

No one would *really* know where they should put hyphens or
spaces in U+238F OPEN-CIRCUIT-OUTPUT H-TYPE SYMBOL without looking
it up in the names list. ;-)

U+269C FLEUR-DE-LIS uses the *English* spelling which in most
dictionaries shows hyphens, but a French speaker would be more
likely to use "fleur de lis" without hyphens, since that is
the French spelling.

The point of a loose matching rule for character names like
this is to capture reasonable expectations about what people
might want to do in contexts like identifier, label, or
presentation, and still successfully match the
intended character.

And the standardization committees (UTC and WG2) are aware of
the loose matching rule for character names, and check against
it when creating new character names, so as not to introduce
character names that would be ambiguous under that loose
matching rule.

> > This is a false analogy because Unicode has never said that 'S' is to be
> > ignored in loose matching. Unicode still says (in TR18) that all
> > hyphens (except in 3 cases) are to be ignored. If hyphens can be
> > significant parts of character names, Unicode should never have said
> > they effectively aren't.
>
> UTS 18 is formally a different standard then the Unicode Standard (TUS)
> (which incorporates UAX#44).
> In this case, you are correct, UTS#18 is in conflict with UAX#44 and
> therefore TUS). The three cases may have been the only cases where
> hyphens resulted in a dinstinct name at the time UTS#18 was drafted, but
> it's clear that this approach is not robust, as long as UTC can add
> additional names under the slightly different rules of UAX#44.
>
> That should result in a correction/corrigendum for UTS#18.

Sorry, but I'm not seeing it.

The conformance requirements for claiming a level of
conformance in UTS #18, RL 1.5 Simple Loose Matches,
and RL 2.4 Default Loose Matches, have only to do with
case-insensitive matching for generic text, and do not
involve ignoring of whitespace, hyphens or underscores.

The only mention of loose matching of the type that we
are talking about is in Section 1.2 Properties, where it is
referring specifically to *property* names and values. And
there is it couched as a recommendation -- not a conformance
requirement:

"It is strongly recommended that both [long and short] property
names be recognized, and that loose matching of property names
be used, whereby the case distinctions, whitespace, hyphens,
and underbar are ignored."

And as Asmus pointed out in an earlier note in this thread,
property names (or more exactly property aliases and
property value aliases) follow a different pattern than
character names. They are unambiguously interpretable if
you ignore all "case distinctions, whitespace, hyphens,
and underbar", because there are no funky edge cases
involving medial hyphens for those. In fact there are no
space characters whatsoever in any of the normative property
aliases or property value aliases in the Unicode Character
Database. And if somebody sticks a space (or spaces)
in a regex expression for something like \p{General Category:Lm}
instead of using \p{gc:Lm}, well, then the kindly (and
reasonable) thing for the regex engine to do would be
to ignore that space, as it is more likely to get the
expected result than it would by throwing a syntax exception.

The applicable loose matching rule in this case is not
the character names loose matching rule (UAX44-LM2), but
rather the symbolic values loose matching rule (UAX44-LM3).

Now granted this hasn't been spelled out explicitly in
the standard all that long -- the elaborations in UAX #44
are of fairly recent provenance. But this was nevertheless
the clear intent of the property alias files all along,
since they first were published as part of the UCD.

--Ken

Next message: karl williamson: "Re: property, character, and sequence name loose matching"
Previous message: jandersen@talentex.co.uk: "Re: Århus mayor prefers Aarhus - "believing the ‘Å’ is a hindrance in international communication""
Maybe in reply to: karl williamson: "property, character, and sequence name loose matching"
Next in thread: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Reply: karl williamson: "Re: property, character, and sequence name loose matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 16 2010 - 18:03:21 CST