Re: Question about formatting numerals

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 24 2006 - 23:44:36 CST

  • Next message: Jukka K. Korpela: "Re: Problem with SSI and BOM"

    I can reply with the conventions that are used in France for almost all newspapers and magazines and most book and guide publishers. The "space" to use is called a "fine" and it is part of the preprint composition process. In ads, this is not counted as a significant character (that can be paid by customers when their inserts are billed according to the number of "lines", i.e. a arbitrary fixed number of "characters"), unlike the normalspace that separates words.

    In preprint composition processes, this space is normally thiner than a normal space, and unbreakable; preprint editors or processors compute the position of these unbreakable thin spaces and use some higher-level markup to specify their position, it is not considered the same as other characters or spaces. For example, those processes or editors allow inserting them as "[FI]" (a markup notation), sometimes with emphasizing color.

    It is mandatory in French typography before colons, semicolons, question marks, exclamation marks, inside guillemets, and is also used as the standard thousands separators, or separator between a numeric quantity and a unit (including currencies) or as the separators in telephone numbers, or specially formatted numbers (like identity numbers).

    If the "unbreakable thin space" was encoded exactly in Unicode, it would not be encoded as a "graphic" character, not even as a character with "space" general category, but really as a *formating control*. There's no exact match in Unicode for this markup or rich text composition, however, there's now a long history, in the time of typewriters, to subsitute it with a space (ensuring that no line break occurs before or after it).

    So the nearest match if one wants to represent it with unicode, without using any markup, remains the NBSP character of Unicode. It's part of the preprint processor to transform the NBSP contextually (encoded near digits or punctuation marks) into "fines". Using normal spaces (U+0020) give errors of interpretation or rendering.

    A more exact representation in plain text, which could be used in the preprint processor would be to use U+2060 (the newer recommanded zero-width non-breaking control) before and after one of the thin spaces listed below (for example U+2009 THIN SPACE). Note that U+2009 has the wrong properties by itself as it is breakable, unless you use a line breaker compatible with the recommanded line breaking technical specification. Or the alternative would be to generate U+202F.

    But in all those listed characters, I think they are not recommanded for input of texts with general purpose, and should be considered only at the time of the final fine-typography composition process just before rendering; if a higher-level protocol is available, it's still best to use markup to specify the position of these "fines",and not encoding any space. Using them in data interchanges seems to be a bad idea.

    Note that English typography and french typography have different recommanded widths for rendering this control: French typography recommands a wider advance than English typography (the French "fine" is roughly about 1/4 em, and the English one is roughly 1/6 em, and the word-spacing "space" is roughly 1/2 em, so one could say that a "fine" is a half-space in French, approximated by a normal space when the "fine" is not directly available, but that it is a third-space in English, approximated by not adding any space when the "fine" is not directly available).

    For simplicy in plain-texts, it's still best to use NBSP, because it is available in most fonts, and mapped in most legacy 8-bit encodings, including those based on ISO 8859, or near from it. And then, let the renderer or preprint composition process transform it contextually into a formating control, what it should be (note that in typography, the "fine" never occurs at the begining or end of lines, but is always used once to change the kerning of a pair of "graphic characters", i.e. glyphs with the Unicode terminology; it carries no semantic by itself as it is used for enhancing the presentation only to help readers).

    ----- Original Message -----
    From: "Guy Steele" <Guy.Steele@sun.com>
    To: <unicode@unicode.org>
    Sent: Wednesday, September 20, 2006 6:06 PM
    Subject: Question about formatting numerals

    > When numerals are to be formatted in formal scientific texts
    > according to the custom of using space to separate the digits
    > in to groups of three, as in "27 312 416.315 67 m/s",
    > what is the recommended Unicode character to use for
    > this separation? Obvious candidates are
    >
    > U+2006 SIX-PER-EM SPACE
    >
    > U+2008 PUNCTUATION SPACE
    > (because then the gap would be equal to the gap caused
    > by the decimal point?)
    >
    > U+2009 THIN SPACE
    >
    > U+200A HAIR SPACE
    >
    > U+202F NARROW NON-BREAKING SPACE
    > (because non-breaking is desirable in running text)
    >
    > What is current practice? What is recommended by Unicode savants?



    This archive was generated by hypermail 2.1.5 : Sun Sep 24 2006 - 23:56:01 CST