Re: Question about formatting numerals

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Sep 25 2006 - 01:37:21 CST

Next message: Mark Davis: "Re: Problem with SSI and BOM"

Previous message: Doug Ewell: "Re: Problem with SSI and BOM"
In reply to: Philippe Verdy: "Re: Question about formatting numerals"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Mon, 25 Sep 2006, Philippe Verdy wrote:

> I can reply with the conventions that are used in France for almost all
> newspapers and magazines and most book and guide publishers. The "space"
> to use is called a "fine" and it is part of the preprint composition
> process.

Thank you for your detailed explanation. Similar issues arise in other
languages as well, though most languages do not thin spaces before or
after some punctuation marks as French does. A common example is the
spacing in formatting numerals, which is where this discussion started.

> In preprint composition processes, this space is normally thiner than a
> normal space, and unbreakable;

As far as I have understood, its width is typically constant in the sense
that it is not changed in justification, though it may be changed
(document-wide) with a program command.

> It is mandatory in French typography before colons, semicolons,
> question marks, exclamation marks, inside guillemets, and is also used
> as the standard thousands separators, or separator between a numeric
> quantity and a unit (including currencies) or as the separators in
> telephone numbers, or specially formatted numbers (like identity
> numbers).

Except for the punctuation marks, such principles are often applied in
other languages as well, at least in good typography. On the other hand,
many people have no idea of the possibilites of preventing a line break
e.g. between a number and a unit (as in "42 m"), so they don't use any
space ("42m"). For similar reasons, a period has often been used as a
thousands separator even in locales where a space is the preferred
separator: if you type "42 000", you might get the number split across
lines, but "42.000" looks safer. Of course, it is not safe at all in the
modern world, where there is a considerable risk of interpreting "." as a
decimal separator.

Between a number and a unit, I would use a normal space, and that's what
many guides recommend. But there are surely uses for a thin space.
Actually, the variation in opinions and practices in using a normal-width
space vs. a thin space suggest that it should be possible to distinguish
between them in plain text. It's not a matter of general styling, as the
specific width of the thin space is, but a matter of using a thin space
here and a normal space there.

> If the "unbreakable thin space" was encoded exactly in Unicode, it
> would not be encoded as a "graphic" character, not even as a character
> with "space" general category, but really as a *formating control*.

That's somewhat debatable, but let's not debate over it here. I think we
share the concern that we would need a Unicode character that can be used
even in plain text to indicate spacing that is thinner than a normal space
_and_ unbreakable.

> So the nearest match if one wants to represent it with unicode, without
> using any markup, remains the NBSP character of Unicode.

The no-break space is the practical choice at present in such a situation
(and assuming that you cannot use program-dependent styling either).
Non-breakability is more essential than thinness here and, besides,
the no-break space is far wider supported than the thin space.

> It's part of
> the preprint processor to transform the NBSP contextually (encoded near
> digits or punctuation marks) into "fines".

That's somewhat risky, and somewhat complex to a poor lonely text
formatter. In an advanced typesetting program, a complicated analysis can
be carried out, but a simple text formatting routine in a program of some
other kind might need to be easy and simple - and language-independent.

For example, you might have text that contains consecutive numbers,
separated by spaces, to be treated as distinct (corresponding to, say,
"In 2006 1500 new patents were applied"). Such usage is usually
stylistically bad and frowned upon in style guides, but that's a different
issue, a different protocol level. It might be an author's mistake to
write so, but the rendering should not make things worse by using a thin
space between the numbers.

> A more exact representation in plain text, which could be used in the
> preprint processor would be to use U+2060 (the newer recommanded
> zero-width non-breaking control) before and after one of the thin spaces
> listed below (for example U+2009 THIN SPACE).

Yes, but that would be rather awkward, at least unless your word processor
has a simple command for inserting U+2060 U+2009 U+2060. Besides, who
knows how different programs would (mis)handle that on input?

> Note that U+2009 has the
> wrong properties by itself as it is breakable, unless you use a line
> breaker compatible with the recommanded line breaking technical
> specification. Or the alternative would be to generate U+202F.

I'm not sure of what U+202F is really meant for.

Anyway, the obvious solution would be to change the line breaking
properties of the thin space U+2009 in Unicode. We cannot change
characters, but we can change their properties, after due considerations.
Is there any particular reason why the thin space is breakable, even
though all known uses (well, all that I know of) are in contexts where a
line break is undesirable and often highly undesirable? Is there software
that relies on the breakability? If yes, could it reasonably be fixed by
adding program-specific rules or modifying or preprocessing data by
adding ZWSP after thin space when needed?

> if a higher-level protocol is available, it's still best to
> use markup to specify the position of these "fines",and not encoding any
> space.

That's debatable. First, using _markup_ for such purposes would
mean that you use various markup elements (for numbers, values of
quantities, phone numbers, questions, quotations, etc.) quite a lot, using
markup that mostly hasn't been implemented yet. (E.g., some browsers have
implemented <q> for inline quotations in HTML, but for various reasons,
it's used very little, and the implementations don't really do any
language-sensitive formatting.) Second, I would still use a space, for
clarity: <phone>+358 40 844 8617</phone> looks much better than
<phone>+358408448167</phone> and works better when you enter or check data

> Note that English typography and french typography have different
> recommanded widths for rendering this control: French typography
> recommands a wider advance than English typography (the French "fine" is
> roughly about 1/4 em, and the English one is roughly 1/6 em,

I'd rather avoid this issue since it makes the "make thin space
unbreakable" idea more difficult to propagate. But apparently it cannot be
avoided.

The Unicode Standard seems to try to make a compromise of a kind: it
characterizes the width of thin space as "1/5 em or sometimes 1/6 em".
But it also has the four-per-em space U+2005.

An obvious solution would be to make both U+2009 and U+2005 nonbreaking
and leave it to text producers and editors to choose which of these
fixed-width spaces they wish to use.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Next message: Mark Davis: "Re: Problem with SSI and BOM"
Previous message: Doug Ewell: "Re: Problem with SSI and BOM"
In reply to: Philippe Verdy: "Re: Question about formatting numerals"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 01:40:20 CST