Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Thu Oct 04 2007 - 22:44:43 CDT

Next message: Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"

Previous message: Mark Davis: "Re: Alignment of IANA language subtag registry to ISO 639-3"
In reply to: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Michael Maxwell: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>>> In addition, the meaning of ranges in sets like [a-z] should also be
>>> consistant with the collation used...
>>
>> I disagree with this. I think that having [a-z] magically
>> mean all characters in a particular language is asking for
>> trouble. In French, would you say that [a-z] should match
>> C WITH CEDILLA or A + ACUTE?
> Having that kind of support allows regexes to be written that match, say
> the top half of a list
> by using [a-k] etc. That's something that you can do in English today,
> but not in any other
> language. You need to decide whether extending regexs to other languages
> should allow
> such uses (in which case you think of collation elements and sorting
> order) or not.
>
> Depending on how many accented letters a language uses, writing the
> equivalent expression manually can be both tedious and error-prone.

The reason I think that [a-z] should only match the 26 code points
is that regular expressions are often used to match things like
domain name parts: [a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])? where
the allowed characters do not change depending on locale.

I agree that having an easy way to say "match any Swedish character",
or some range of the characters, would be useful; maybe this could be
done using something similar to the \p{} syntax for properties? I
don't want to propose anything since I haven't studied it enough yet.

>> It's my opinion that ranges inside [] should be simple binary
>> order. If you want to do anything fancier, there should be
>> new syntax for it.
> That, or an option?

I would be ok with it being an option.

> Now, other than for canonical decompositions (and conjoining Jamo), I've
> not seen an example that informs me of why it is useful for a regex
> package to be able to match 'ch' as if it were a single code point. Can
> somebody please present a simple example that shows an important use
> case that can't be realized if regexes are limited to a single character
> (plus *canonical* equivalents).

I don't know the reason -- I just implemented all the features
required for level 1 and level 2 conformance, and part of level 2
is being able to do this.

> After all, the atomic elements for writing would be the 'c' and 'h', it
> is only for the purpose of some other text operations that 'ch' are
> (sometimes) considered a unit.

I used to be fluent in written Spanish, but despite that, I never
considered ch, ll, or rr to be single characters. I think I did
a Spanish crossword once where ch went into a single square.

Mike

Next message: Mike: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Previous message: Mark Davis: "Re: Alignment of IANA language subtag registry to ISO 639-3"
In reply to: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Michael Maxwell: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Oct 05 2007 - 00:31:16 CDT