UAX#14-20: undesriable line breaking opportunities (parenthese and quotation marks)

From: Philippe Verdy (
Date: Wed Jul 25 2007 - 06:46:00 CDT

  • Next message: Philippe Verdy: "RE: Titles and headings in Georgian script"

    I have not been able to submit this using the Unicode bug report form (due
    to a technical problem of this form), because it does not accept my valid
    email address: it incorrectly rejects the underscore (_) present in the user
    name part of my email address, despite it is perfectly valid here (only
    forbidden in the Internet domain name part).

    Can someone in the UTC check once again the code for the Bug report form on
    the Unicode site so that it will correctly parse and validate email
    addresses by not assuming that the characters allowed on each side of the
    "@" are the same (they have never been the same subsets, per RFC

    AND PLEASE, can someone at UTC (Rick McGowan?) copy-paste this message below
    using his own email address for posting there, so that it is considered in
    the current public review of UAX#14 update 20 (closing on July 30)?
    Some comments about apparently forgotten cases.

    The line breaking opportunities does not seem to handle some special cases
    related to undesirable line breaks that are currently allowed.
    This comes for example with parentheses, that currently always allow line
    breaks after or before them and text they surround.

    I can cite an example, in the officially documented French toponyms:
    "Château-Chinon(Ville)" and "Château-Chinon(Campagne)" which are designating
    two distinct French communes, and form a single compound name. The INSEE
    officially writes them WITHOUT a space separator (then the term within
    parentheses is not a common word but part of the toponym, so it takes a
    mandatory capital.

    In this case, allowing a line break before the opening parenthese would
    allow a rendering where the line break, if inserted would be interpreted as
    if there was a space, and the required capital on the term "Ville" or
    "Campagne" between parentheses would look like a typo.

    Note the difference with the French names of a few cantons that are
    *qualified* by adding " (ville)" or " (campagne)" with a space separator and
    no capital for the specifier (this occurs for example in the canton and
    arrondissement around the French city (toponym) of "Strasbourg". The
    generated name is NOT creating a compound name.

    Note the difference with toponyms (or other proper names) that would be
    otherwise written as "...-Ville" or "...-Campagne": in this case the
    linebreak is possible after the hyphen, which remains when a line break
    occurs and still explicitly marks that this is a compound name.

    For strange reasons, the INSEE reference for French administrative units
    (and the IGN, for its official toponyms) have used parentheses instead of an

    How to handle this case, in a way so that parentheses will not allow a
    linebreak on BOTH sides of parentheses if they are surrounded by

    I can give another more common example where such linebreaks are
    "un (ou plusieurs) mot(s)"
    Note how the "s" plural mark in "mots" is marked as an alternative; it is
    not separable from the word it normally completes. inserting a linebreak
    between "mot" and "(s)" would be wrong.

    Another example when writing maths formulas "f(x) = x + 2". Here again, the
    term "f(x)" should remain unbreakable. The same should occur as well with
    the term "f[x]" in "f[x] = x + 2".

    I propose disallowing line breaks around ***BOTH*** sides of:
    * (parentheses), or parenthese-like characters like
    * [square brackets],
    * ‹angle brackets or quotation marks› (we can accept it for lower than and
    higher than signs), or even
    * “double 6/9 quotation marks”, or
    * «double angle quotation marks», or
    * ‘single 6/9 quotation marks’, or
    if and only if, the characters that are on each side of the marks would be
    unbreakable in absence of these marks.

    Note that I include the quotation marks because they are quite often used to
    emphasize some important parts within a word.

    This will also cover the case where ‘single 6/9 quotation marks’ are also
    used as apostrophes (common in French, English to mark elision of letters or
    some abbreviated words) or reversed apostrophes (used in polynesian
    languages as a glottal consontal mark).

    Are there known cases where a line break would still remain desirable with
    these conditions?


    This archive was generated by hypermail 2.1.5 : Wed Jul 25 2007 - 06:48:09 CDT