RE: Problem in Line breaking

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Feb 24 2008 - 15:22:20 CST

  • Next message: Asmus Freytag: "Re: Problem in Line breaking"

    There's no problem:
    1) CL and AL classes are tailorable (no star "*" in table 1).
    2) CL prohibits line breaks BEFORE it, but allows them AFTER it in some conditions.
    3) Paragraph 3.1 describes three kinds of tailoring profiles. The default line-breaking rules are not specifying any one of the three profiles, tailoring per language remains.
    4) Using only de default rules (withoutany tailoring) there's a line-breal opportunity after CL.

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org
    > [mailto:unicode-bounce@unicode.org] De la part de Satoshi Nakagawa
    > Envoyé : samedi 23 février 2008 20:48
    > À : unicode@unicode.org
    > Objet : Problem in Line breaking
    >
    > Hi,
    >
    > I found a problem in the Unicode line breaking algorithm.
    >
    > In Japanese writing, [こたえは、answer] should be breakable into
    > lines like:
    >
    > こたえは、
    > answer
    >
    > Because [、](U+3001) and [。](U+3002) in Japanese are used just
    > like comma and period in English. We can break a line after
    > comma or period in English.
    >
    > But the current Unicode line breaking algorithm doesn't allow
    > this behavior for (U+3001) and (U+3002).
    >
    > I think it's a problem of the Unicode line breaking algorithm.
    > See http://www.unicode.org/reports/tr14/ .
    >
    > > CL: Closing Punctuation (XB)
    > >
    > > 3001..3002 IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP
    >
    > (U+3001) and (U+3002) are specified as CL.
    >
    > > LB30
    > > Do not break between letters, numbers, or ordinary symbols
    > and opening
    > > or closing punctuation.
    > >
    > > CL × (AL | NU)

    In fact the rules translate as CL % AL, i.e. it's an indirect break:
     CL × AL
    but
     CL × SP* ÷ AL

    Plus, paragraph 8.2 example 5 indicates that NS will be used for Kanas, where NS % CL, and fixes conditions for changing NS to ID so that: ID ^ CL, but NS NS is treated then like ID ÷ ID.

    Rule LB30 (only in Unicode 5.0.0, not before) was intented for words like "person(s)" to avoid breaking between the singular and the unbreakable suffix (or prefix or optional infixes) with parentheses it assumed that the script within the parentheses would be the same as the script used outside of them. May be there should exist a way to "transport" the script property of the inner character to the outer parenthese to see if that can break or should not break.

    Also, the ideographic comma (that comes after kanas or ideographs) acts as a regular comma and space when it is preceding Latin letters. Some tailoring is still needed for Japanese because there exists cases with words using mixed scripts, such as Latin trademark and a Japanese kana suffix added to it, or abbreviations using Latin letters.

    Without LB30, we would break in the middle of the single French past participe "fait(e)s" where the feminine is written as optional (when the genre of the qualified object is unknown). The first rule of LB30 prohibits breaking between "fait" and "(", the second rule of LB30 prohibits breking before the final s. But the good question is: do we allow break betwen "t" and "e" or between "e" and "s" independantly of the presence of parentheses around "e". As we normally don't break in absence of parentheses, the simple inclusion of parentheses enclosing punctuation without surrounding space does not change this unbreakability.

    Bow consider your case: we have a sequence of kanas followed by latin letters: they don't break by default, and the simple includion of a comma (closing/ending punctiation) between them should not make them break without explicit tailoring. This case also occurs in numbers for group separators and decimal separators. This requires more advanced tailoring than just a pairing table.



    This archive was generated by hypermail 2.1.5 : Sun Feb 24 2008 - 19:14:57 CST