Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Jul 27 2007 - 00:23:41 CDT

Next message: Asmus Freytag: "Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"

Previous message: Asmus Freytag: "Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
In reply to: Philippe Verdy: "RE: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
Next in thread: Rick McGowan: "Re: RE: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 7/26/2007 1:34 AM, Philippe Verdy wrote:
>> Also, the class CM inherits from the *preceding* character. Your model
>> would result in inheritance in the other direction, which would
>> invalidate all existing implementations (not even those that import the
>> UCD tables could update to such a scheme w/o changes in architecture).
>>
>
> I have NOT spoken of the CM class. I don't know why you are speaking about
> it.
>
Because I am contrasting the only case in the current algorithm where
there is "inheritance" with your proposal :

  ...parentheses or quotation marks could also
  be described by making them inherit the line
  breaking opportunity property from the character
  they immediately surround...

Your proposal would introduce a new type of "look-back" which is something that wasn't required before and would break many implementations.

> Te only relevant rule is LB30, but anyway if it effectively solved the
> problem for the case of "word(s)" in the Latin script, the effect of LB30
> will be too broad in ideographic texts.
>
No, it won't. Apparently you are having difficulties to correctly apply
the chain of rules from the top, otherwise you would not have made this
assertion.

Because it is really difficult to apply 30 rules in sequence in your
head, I've always insisted that the linebreak algorithm be representable
as a pair table.

The specification in section 7.3 captures the same rules, but the
results can be read off the table directly. The case you are interested in

ID OP

is found by reading the intersection between the ID row and OP column in
the table (rows are the "before" context, columns are the "after"
context). At the given intersection you find a "_" meaning that breaks
are allowed without space.

Conversely for

OP ID

you will find that there's a "^" at the intersection of the OP row and
ID column, meaning that breaks are not allowed.

So, all the lengthy discussion that follows can be disregarded, since it
starts from a mistaken premise. From here...
> Suppose that I1, I2, I3 are sequences of ideographs. If they occur in
> sequence in such a way that line breaking is allowed between them, then we
> have:
>
> I1 ÷ I2 ÷ I3
>
> This is the normal way to handle line breaks in ideographic texts (or other
> scripts that typically don't use explicit whitespaces, so extend this
> discussion to these scripts too.)
>
> Now suppose that punctuatuon pairs (OP and CL) are used in sequence :
> I1 OP I2 CL I3
> The line-breaking should now be prohibited between OP and I2 and between I2
> and CL, however it should not be prohibited between I1 and OP and between CL
> and I3. In other words:
> I1 ÷ OP × I2 × CL ÷ I3
>
> The best way to formulate what I mean is :
>
> Independently of the nature of the <a>, <b>, <c> or <d> characters
> below, we are here just considering the characters on each side of the
> opening and closing character, such that :
> * a line break between <a> and <OP,b> should be allowed (resp.
> prohibited) if and only if a line break would be allowed (resp. prohibited
> between <a> and <b>. The sequence <OP,b> is not breakable.
> * a line break between <c,CL> and <d> should be allowed (resp.
> prohibited) if and only if a line break would be allowed (resp. prohibited
> between <c> and <d>. The sequence <c,CL> is not breakable.
>
>
...to here.
> And this is what I mean when I say that:
> * opening punctuation should be treated as if they were extending
> the grapheme cluster of the first characters encoded after it, and
> inheriting its line-breaking properties (this has NOTHING to do with the CM
> class that I did not discuss).
>
The CM class is the one that implements graphem clusters. So yes, your
discussion has everything to do with it, since you now have additional
classes that *inherit* (with all the problems that CM has (note the
special handling it requires, for example in section 7.5 in UAX#14)
*plus* the additional complication that your new method would inherit
downstream as well as upstream.
> * closing punctuation should be treated as if they were extending
> the grapheme cluster of the first characters encoded after it, and
> inheriting its line-breaking properties (this has NOTHING to do with the CM
> class that I did not discuss).
>
> These rules are a bit different from LB30, and I think more appropriate
> because they will work in the ideographic context. I suspect that if LB30 is
> not implemented, it's because it did not work correctly with these texts.
>
>
>
These new rules are not required, since UAX#14 already performs
satisfactorily - if implemented correctly.

A./
>
>
>

Next message: Asmus Freytag: "Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
Previous message: Asmus Freytag: "Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
In reply to: Philippe Verdy: "RE: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
Next in thread: Rick McGowan: "Re: RE: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 27 2007 - 00:26:39 CDT