Re: Question about formatting numerals

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Sep 25 2006 - 01:37:21 CST

  • Next message: Mark Davis: "Re: Problem with SSI and BOM"

    On Mon, 25 Sep 2006, Philippe Verdy wrote:

    > I can reply with the conventions that are used in France for almost all
    > newspapers and magazines and most book and guide publishers. The "space"
    > to use is called a "fine" and it is part of the preprint composition
    > process.

    Thank you for your detailed explanation. Similar issues arise in other
    languages as well, though most languages do not thin spaces before or
    after some punctuation marks as French does. A common example is the
    spacing in formatting numerals, which is where this discussion started.

    > In preprint composition processes, this space is normally thiner than a
    > normal space, and unbreakable;

    As far as I have understood, its width is typically constant in the sense
    that it is not changed in justification, though it may be changed
    (document-wide) with a program command.

    > It is mandatory in French typography before colons, semicolons,
    > question marks, exclamation marks, inside guillemets, and is also used
    > as the standard thousands separators, or separator between a numeric
    > quantity and a unit (including currencies) or as the separators in
    > telephone numbers, or specially formatted numbers (like identity
    > numbers).

    Except for the punctuation marks, such principles are often applied in
    other languages as well, at least in good typography. On the other hand,
    many people have no idea of the possibilites of preventing a line break
    e.g. between a number and a unit (as in "42 m"), so they don't use any
    space ("42m"). For similar reasons, a period has often been used as a
    thousands separator even in locales where a space is the preferred
    separator: if you type "42 000", you might get the number split across
    lines, but "42.000" looks safer. Of course, it is not safe at all in the
    modern world, where there is a considerable risk of interpreting "." as a
    decimal separator.

    Between a number and a unit, I would use a normal space, and that's what
    many guides recommend. But there are surely uses for a thin space.
    Actually, the variation in opinions and practices in using a normal-width
    space vs. a thin space suggest that it should be possible to distinguish
    between them in plain text. It's not a matter of general styling, as the
    specific width of the thin space is, but a matter of using a thin space
    here and a normal space there.

    > If the "unbreakable thin space" was encoded exactly in Unicode, it
    > would not be encoded as a "graphic" character, not even as a character
    > with "space" general category, but really as a *formating control*.

    That's somewhat debatable, but let's not debate over it here. I think we
    share the concern that we would need a Unicode character that can be used
    even in plain text to indicate spacing that is thinner than a normal space
    _and_ unbreakable.

    > So the nearest match if one wants to represent it with unicode, without
    > using any markup, remains the NBSP character of Unicode.

    The no-break space is the practical choice at present in such a situation
    (and assuming that you cannot use program-dependent styling either).
    Non-breakability is more essential than thinness here and, besides,
    the no-break space is far wider supported than the thin space.

    > It's part of
    > the preprint processor to transform the NBSP contextually (encoded near
    > digits or punctuation marks) into "fines".

    That's somewhat risky, and somewhat complex to a poor lonely text
    formatter. In an advanced typesetting program, a complicated analysis can
    be carried out, but a simple text formatting routine in a program of some
    other kind might need to be easy and simple - and language-independent.

    For example, you might have text that contains consecutive numbers,
    separated by spaces, to be treated as distinct (corresponding to, say,
    "In 2006 1500 new patents were applied"). Such usage is usually
    stylistically bad and frowned upon in style guides, but that's a different
    issue, a different protocol level. It might be an author's mistake to
    write so, but the rendering should not make things worse by using a thin
    space between the numbers.

    > A more exact representation in plain text, which could be used in the
    > preprint processor would be to use U+2060 (the newer recommanded
    > zero-width non-breaking control) before and after one of the thin spaces
    > listed below (for example U+2009 THIN SPACE).

    Yes, but that would be rather awkward, at least unless your word processor
    has a simple command for inserting U+2060 U+2009 U+2060. Besides, who
    knows how different programs would (mis)handle that on input?

    > Note that U+2009 has the
    > wrong properties by itself as it is breakable, unless you use a line
    > breaker compatible with the recommanded line breaking technical
    > specification. Or the alternative would be to generate U+202F.

    I'm not sure of what U+202F is really meant for.

    Anyway, the obvious solution would be to change the line breaking
    properties of the thin space U+2009 in Unicode. We cannot change
    characters, but we can change their properties, after due considerations.
    Is there any particular reason why the thin space is breakable, even
    though all known uses (well, all that I know of) are in contexts where a
    line break is undesirable and often highly undesirable? Is there software
    that relies on the breakability? If yes, could it reasonably be fixed by
    adding program-specific rules or modifying or preprocessing data by
    adding ZWSP after thin space when needed?

    > if a higher-level protocol is available, it's still best to
    > use markup to specify the position of these "fines",and not encoding any
    > space.

    That's debatable. First, using _markup_ for such purposes would
    mean that you use various markup elements (for numbers, values of
    quantities, phone numbers, questions, quotations, etc.) quite a lot, using
    markup that mostly hasn't been implemented yet. (E.g., some browsers have
    implemented <q> for inline quotations in HTML, but for various reasons,
    it's used very little, and the implementations don't really do any
    language-sensitive formatting.) Second, I would still use a space, for
    clarity: <phone>+358 40 844 8617</phone> looks much better than
    <phone>+358408448167</phone> and works better when you enter or check data

    > Note that English typography and french typography have different
    > recommanded widths for rendering this control: French typography
    > recommands a wider advance than English typography (the French "fine" is
    > roughly about 1/4 em, and the English one is roughly 1/6 em,

    I'd rather avoid this issue since it makes the "make thin space
    unbreakable" idea more difficult to propagate. But apparently it cannot be
    avoided.

    The Unicode Standard seems to try to make a compromise of a kind: it
    characterizes the width of thin space as "1/5 em or sometimes 1/6 em".
    But it also has the four-per-em space U+2005.

    An obvious solution would be to make both U+2009 and U+2005 nonbreaking
    and leave it to text producers and editors to choose which of these
    fixed-width spaces they wish to use.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 01:40:20 CST