L2/02-357
From: Eric Muller
Date: 2002-10-25 18:10:13 -0700
Subject: Re: Dashes

I made the original proposal for new dashes under the assumption that the various dashes are defined in Unicode by their width; that's what the first part of the proposal explains. Under that assumption, I was really after the 3/4 em dash, and included the other widths for completeness.

The discussion during the UTC meeting revealed that my assumption was wrong. I was given the action item 92-A11: "Get contrasting examples to show em dashes, 3/4 em dashes and 1/3 em dashes." As some of you predicted, I did not find any convincing example (other than the obvious "here are the various widths of dashes that are in use:..."). Lisa and Cathy, you can take this a resolution of the AI: "done; they aren't any".

Given that, my current take on 2em and 3em dashes, and on the public issue in particular is:

font technologies, at least Type1/CFF, make it hard to ensure that two separate glyphs connect under all rendering circumstances. In Type1, a glyph carries both dimensional information and structural information (e.g. these two stems ought to have the same rendered width; aka hints). The connection property can only be described as dimensional information. The process of rasterizing cannot always satisfy both descriptions at the same time, in particular at low resolution. In those cases, the structural information takes precedence. In the end, no matter how we do it, under current technology, there will be some point size on some device where there will be a white pixel between the two dashes. I am not familiar enough with the TrueType technology to say the same applies. This contrasts with the metal technology where the connection aspect can be guaranteed (in part relying on ink bleeding)

this can be handled by ligatures, but such machinery is not deployed widely enough to depend on it in this case. Saying "you can use two consecutive EM DASH characters to encode a 2em dash, and rendering systems are encouraged to render them as connected, but you are not guaranteed to obtain that effect" would be ok, but saying "all renderers must connect two consecutive dashes" is too much.

typographically, 2em and 3em dashes can be handled by rules very easily. In fact, I'd say that this is how many typographers see them. There are more similar to the leaders used, e.g., in tables of content, they just happen to be small.

In the end, I don't feel a strong the need to express 2em and 3em as characters, either on their own or as compositions of existing dashes; and I don't think we can achieve a reliable rendering by composition.

This discussion also touched on the use of dashes for quotations, e.g. in French. Two comments:

U+2015 HORIZONTAL BAR (= QUOTATION DASH) is for that usage, correct? Whether a particular style choses to make quotation dashes an em or anything else, with or without space around them, it is still the case that the "proper" encoding of documents uses U+2015 in that case.

I like them! That how I was raised 8-)

Now, I am still left with my original problem. My new assumption is the following: the dashes are defined by their use. This actually solves rather nicely the problem of the various styles in use, which the "defined by width" approach does not handle well. Let me give a little bit of context: the kind of workflow I am interested in is when you have documents with characters and markup but no style on the one hand, and style sheets on the other; to be concrete, Docbook and XSLT stylesheets is a good archetype.What's interesting about this model is that it accounts well for "network publishing", i.e. the same material (Docbook document) presented in multiple ways (change the stylesheet and their target) and it can also be used to explain the more traditional wysiwig approach as well, by saying that the user manipulates simultaneously the document and the stylesheet.

In that world, defining the dash characters by their width is problematic. The decision to set off a phrase using non-spaced em dashes or to set off a phrase using space en-dashes really belongs to the stylesheet, and it is desirable to have the same content in the document, regardless of the style(s) by which this content is going to be rendered. This is much easier to achieve if we declare that U+2014 EM DASH is the character used to set off a phrase, and that it can be rendered by a spaced en-dash. The only alternative I see is to carry the "set off a phrase" bit by markup instead, but that seems a bit heavy handed.

In the end, my new quest is to get U+2014 EM DASH and U+2013 EN DASH understood literally as they are described in section 6.1, by their use, and to essentially ignore the EM and EN in their names (much like we all know to replace LEFT by OPENING in U+0028 LEFT PARENTHESIS). Together with U+2015 HORIZONTAL BAR (understood as a quotation dash, as used in French) and U+2012 FIGURE DASH, I believe we have covered all the important functions. It may be worth crafting additional words for 6.1, to say that those characters can be rendered by glyphs that are not 1em or 1en wide, with more or less space. I'll be happy to propose some words to that affect if we like this approach.

Thanks to Ken for opposing my new dash proposal (at least for 3/4em and 1/3em); first because he is right, second because this is not what I need.

Eric.