Re: UAX 29 questions from Karl Williamson on 2015-01-29 (Unicode Mail List Archive)

From: Karl Williamson <public_at_khwilliamson.com>
Date: Thu, 29 Jan 2015 11:52:30 -0700

On 01/25/2015 05:14 AM, Philippe Verdy wrote:
> This is not a contradiction.

At the very least it is too sloppy for a standard. Once there is a
match in the list of rules, later rules shouldn't have to be looked at.
I'll submit a formal feedback form.

But there is another issue as well. I do not see how the specified
rules when applied to the sequence of code points:

U+0041 U+200D U+0020

cause the ZWJ, an Extend, to not break with the "A", an ALetter.

Rule WB4 is

"Ignore Format and Extend characters, except when they appear at the
beginning of a region of text.".

Not clearly stated, but it appears to me that the ZWJ must be considered
here to be the beginning of a region of text, as we are looking at the
boundary between it and the "A". No rule specifically mentions ALetter
followed by an Extend, so by the default rule, WB14

"Otherwise, break everywhere (including around ideographs)"

this should be a word break position. But that is absurd, as the Extend
is supposed to extend what precedes it. If I add a rule

"Don't break before Extend or Format"
× (Extend | Format)

my implementation passes all tests. I added this rule before WB4.

>
> combine the two rules and they are equivalent to these two alternate rules:
> WB56 can be read as these two:
>
> (WB56a) ALetter × (MidLetter | MidNumLet | Single_Quote) (ALetter |
> Hebrew_Letter)
>
> (WB56b) Hebrew_Letter × (MidLetter | MidNumLet | Single_Quote)
> (ALetter | Hebrew_Letter)
>
>
> Then add :
>
> (WB57) Hebrew_Letter × Single_Quote
>
> it just removes the condition of a letter following the quote in WB56b.
> So that WB56b and WB57 can be read as equivalent to these two:
>
> (WB56c) Hebrew_Letter × (MidLetter | MidNumLet) (ALetter |
> Hebrew_Letter)
>
> (WB57) Hebrew_Letter × Single_Quote
>
> But you cannot merge any of these two last rules in a single rule for WB56.
>
>
> 2015-01-25 7:26 GMT+01:00 Karl Williamson <public_at_khwilliamson.com
> <mailto:public_at_khwilliamson.com>>:
>
> I vaguely recall asking something like this before, but if so, I
> didn't save the answers, and a search of the archives didn't turn up
> anything.
>
> Some of the rules in UAX #29 don't make sense to me.
>
> For example, rule WB7a
> Hebrew_Letter × Single_Quote
>
> seems to say that a Hebrew_Letter followed by a Single Quote
> shouldn't break. (And Rule WB4 says that actually there can be
> Extend and Format characters between the two and those should be
> ignored).
>
> But the earlier rule, WB6
>
> (ALetter | Hebrew_Letter) × (MidLetter | MidNumLet |
> Single_Quote) (ALetter | Hebrew_Letter)
>
> seems to me to say (among other things) that a Hebrew Letter
> followed by a Single Quote shouldn't break if and only if the latter
> is also followed by either an ALetter or another Hebrew Letter
> (again modulo ignored Format and Extend letters)
>
> This seems contradictory. One rule says something unconditionally,
> and the other rule adds conditions.
> _________________________________________________
> Unicode mailing list
> Unicode_at_unicode.org <mailto:Unicode_at_unicode.org>
> http://unicode.org/mailman/__listinfo/unicode
> <http://unicode.org/mailman/listinfo/unicode>
>
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Thu Jan 29 2015 - 12:53:48 CST

This archive was generated by hypermail 2.2.0 : Thu Jan 29 2015 - 12:53:48 CST