Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Jul 27 2007 - 00:23:41 CDT

  • Next message: Asmus Freytag: "Re: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"

    On 7/26/2007 1:34 AM, Philippe Verdy wrote:
    >> Also, the class CM inherits from the *preceding* character. Your model
    >> would result in inheritance in the other direction, which would
    >> invalidate all existing implementations (not even those that import the
    >> UCD tables could update to such a scheme w/o changes in architecture).
    >>
    >
    > I have NOT spoken of the CM class. I don't know why you are speaking about
    > it.
    >
    Because I am contrasting the only case in the current algorithm where
    there is "inheritance" with your proposal :

      ...parentheses or quotation marks could also
      be described by making them inherit the line
      breaking opportunity property from the character
      they immediately surround...

    Your proposal would introduce a new type of "look-back" which is something that wasn't required before and would break many implementations.

    > Te only relevant rule is LB30, but anyway if it effectively solved the
    > problem for the case of "word(s)" in the Latin script, the effect of LB30
    > will be too broad in ideographic texts.
    >
    No, it won't. Apparently you are having difficulties to correctly apply
    the chain of rules from the top, otherwise you would not have made this
    assertion.

    Because it is really difficult to apply 30 rules in sequence in your
    head, I've always insisted that the linebreak algorithm be representable
    as a pair table.

    The specification in section 7.3 captures the same rules, but the
    results can be read off the table directly. The case you are interested in

        ID OP

    is found by reading the intersection between the ID row and OP column in
    the table (rows are the "before" context, columns are the "after"
    context). At the given intersection you find a "_" meaning that breaks
    are allowed without space.

    Conversely for

        OP ID

    you will find that there's a "^" at the intersection of the OP row and
    ID column, meaning that breaks are not allowed.

    So, all the lengthy discussion that follows can be disregarded, since it
    starts from a mistaken premise. From here...
    > Suppose that I1, I2, I3 are sequences of ideographs. If they occur in
    > sequence in such a way that line breaking is allowed between them, then we
    > have:
    >
    > I1 ÷ I2 ÷ I3
    >
    > This is the normal way to handle line breaks in ideographic texts (or other
    > scripts that typically don't use explicit whitespaces, so extend this
    > discussion to these scripts too.)
    >
    > Now suppose that punctuatuon pairs (OP and CL) are used in sequence :
    > I1 OP I2 CL I3
    > The line-breaking should now be prohibited between OP and I2 and between I2
    > and CL, however it should not be prohibited between I1 and OP and between CL
    > and I3. In other words:
    > I1 ÷ OP × I2 × CL ÷ I3
    >
    > The best way to formulate what I mean is :
    >
    > Independently of the nature of the <a>, <b>, <c> or <d> characters
    > below, we are here just considering the characters on each side of the
    > opening and closing character, such that :
    > * a line break between <a> and <OP,b> should be allowed (resp.
    > prohibited) if and only if a line break would be allowed (resp. prohibited
    > between <a> and <b>. The sequence <OP,b> is not breakable.
    > * a line break between <c,CL> and <d> should be allowed (resp.
    > prohibited) if and only if a line break would be allowed (resp. prohibited
    > between <c> and <d>. The sequence <c,CL> is not breakable.
    >
    >
    ...to here.
    > And this is what I mean when I say that:
    > * opening punctuation should be treated as if they were extending
    > the grapheme cluster of the first characters encoded after it, and
    > inheriting its line-breaking properties (this has NOTHING to do with the CM
    > class that I did not discuss).
    >
    The CM class is the one that implements graphem clusters. So yes, your
    discussion has everything to do with it, since you now have additional
    classes that *inherit* (with all the problems that CM has (note the
    special handling it requires, for example in section 7.5 in UAX#14)
    *plus* the additional complication that your new method would inherit
    downstream as well as upstream.
    > * closing punctuation should be treated as if they were extending
    > the grapheme cluster of the first characters encoded after it, and
    > inheriting its line-breaking properties (this has NOTHING to do with the CM
    > class that I did not discuss).
    >
    > These rules are a bit different from LB30, and I think more appropriate
    > because they will work in the ideographic context. I suspect that if LB30 is
    > not implemented, it's because it did not work correctly with these texts.
    >
    >
    >
    These new rules are not required, since UAX#14 already performs
    satisfactorily - if implemented correctly.

    A./
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jul 27 2007 - 00:26:39 CDT