RE: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 26 2007 - 04:28:46 CDT

  • Next message: Michael Maxwell: "RE: UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)"

    > -----Message d'origine-----
    > De : Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    > Envoyé : jeudi 26 juillet 2007 09:39
    > À : 'Kenneth Whistler'
    > Cc : 'unicode@unicode.org'
    > Objet : RE: UAX#14-20: undesriable line breaking opportunities (parenthese
    > and quotation marks)
    >
    > > And in particular, the relevant rules are:
    > > (...)
    > > LB30 Do not break between letters, numbers, or ordinary symbols and
    > > opening or closing punctuation.
    > >
    > > (AL | NU) × OP
    > > CL × (AL | NU)
    > >
    > > Those rules seem *already* to be doing exactly what you seem to
    > > be asking for.

    If you really think that this rules are sufficient, I still maintain that
    this rule is ambiguous, and consists in fact into TWO separate rules that
    are incorrectly summarized by its description (the term "between" combined
    with the "or" used in "opening or closing" is the main source of confusion).

    So I am suggesting to rewrite it as:

            LB30.1 Do not break after letters, numbers, or ordinary symbols
            and before opening punctuation.

            (AL | NU) × OP

            LB30.2 Do not break after closing punctuation and
            before letters, numbers, or ordinary symbols.

            CL × (AL | NU)

    And I would add a third item speaking about punctuations that may be used
    both as opening or closing punctuation, either because this is
    language/locale dependant (notably quotation marks), or because they are
    intrinsicly ambiguous (such as the ASCII vertical single or double quotes).

    In such a case, if it can't be determined (from the character itself or from
    the language effectively in use) that a punctuation is opening or closing,
    then the two separate rules should BOTH apply, by making these punctuation
    signs parts of the TWO line-breaking classes OP and CL.

    Now, about the implementation :
    * for closing punctuations it is simple to handle this case by treating it
    as if they were combining characters encoded after the combining sequence
    that it extends so that it is handled as if it was a larger grapheme
    cluster. This should occur in all cases except after whitespaces and
    explicit line-break controls (or explicit ends of verses if they are marked
    as such in some scripts, such as double dandas).
    * for opening punctuations, the case is a bit more difficult because it will
    require an additional forward lookup to see how to handle them.
    * for ambiguously opening or closing punctuations (mostly, the quotation
    marks discussed above), the best way to handle them is to prohibit line
    breaks BOTH before AND after them, unless the characters before or after
    them are whitespaces or characters explicitly forcing a line-break or
    indicating explicitly that a line break is allowed, such as a disjoiner
    control.



    This archive was generated by hypermail 2.1.5 : Thu Jul 26 2007 - 04:31:08 CDT