Re: When do you use U+2024 ONE DOT LEADER instead of U+002E FULL STOP?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 30 2003 - 21:53:02 EDT

  • Next message: Jim Allan: "Re: When do you use U+2024 ONE DOT LEADER instead of U+002E FULL STOP?"

    Philippe Verdy continued:

    > What surprizes me the most in the Unicode spec is that it
    > both says that its purpose is to create arbitrary length
    > of leaders

    As in plain text, as can be seen in Table of Content listings
    in many RFCs, for example. (Which, however, use ASCII 0x2E for the
    same purpose.)

    > (you say that the spacing statement in the Xerox name was
    > not considered important by Xerox, so how many leaders would
    > be needed to fit a en space with the Unicode designation?).

    If you mean how many leader *dots* would it take to fit an en
    space, that would depend on the font in Unicode, as for so
    much else. My guess would be that the correct answer is
    approximately the same as the number of angels that can stand
    on the dot.

    Very few characters in Unicode have any specified widths. That
    is by design.

    > Why then do you insist that it represents one dot ?

    Because that was the intent of the Unicode Technical Committee
    when it encoded the character, and is the clear intent of the
    standard as currently specified.

    > You also seem to insist o the "compatibility" decomposition
    > which is normally removing an important semantic (else it
    > would be canonical).

    I'm simply restating the specification in the standard. Read it
    yourself.

    > All this seems like creating contradictions.
    >
    > Also it would be the only punctuation sign whose number of
    > occurences is not relevant

    False. See the discussion of Tibetan justifying tseks in:

    http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf

    > (in dotted lines used as leaders),

    Or, for that matter, in plain text visual line separations
    also created by stringing together ASCII punctuation:
    **********************************************************
    like that. Such legacy use of punctuation characters is no
    different than legacy use of a sequence of periods to create
    leader lines in plain text.

    > as the final presentation of the text will need to compensate
    > for font metrics differences in order to produce the correct
    > effect (also because the size of the dots where removed from
    > the Unicode designation.)

    So? That is irrelevant to the question at hand. People who do
    stuff like this, as in plain text RFCs, display text in
    monospace fonts and don't expect dynamic reflowing of text.

    People who do leader lines correctly for fine typography do
    them with internal data abstractions, and those data abstractions
    aren't based on interpreting U+2024 as a format control character.

    > I do no agree wih your argument that says that it is like a
    > full dot to be used in limited applications

    You can disagree with my argument all you like. But if you insist
    on coming on the unicode list and spouting nonsense about
    particular characters in the standard, suggesting that people
    implement them in ways that would be nonconformant with the
    standard, then expect people to respond to the nonsense.

    > (if Unicode wanted to remove the spacing, it was to generalize
    > is use as an abstract character, not to reenforce its mapping
    > to an approximate full dot.)

    That claim is errant nonsense.

    > I never heard about the Xerox CCS before, but there's a large
    > legacy usage of the ellipsis as a single unbreakable character

    Correct. And U+2026 is encoded precisely for that legacy practice.

    > (and the two dots for the notation of interval bounds are also
    > unbreakable).

    True, but this kind of behavior falls automatically out of most
    implementations' treatment of U+002E characters in sequence.
    Check UAX #14, which discusses the line break behavior of both
    the leader dot characters and U+002E FULL STOP. U+002E is lb class
    IS, and since class IS prohibits a break before, a sequence of
    two periods in a row, as in [0..1] does not have a break
    opportunity in the middle of the sequence.

    > The single dot leader looks like a way to fill the gap,
    > only because two-dot three-dots ellipsis did not allow,
    > in most fonts and applications, to create a regular leader,
    > using smaller dots than the one used for the regular full stop
    > punctuation.

    You are mixing up glyphs and characters here.

    In "most fonts and applications" leader dots are *glyphs* used
    to express a measured leader line, not characters at all.

    > The fact that it was unified with XCCS (with some
    > compromizes accepted by Xerox) clearly demonstrates that
    > the Xerox design was not the main focus:

    In the case of encoding of the ONE DOT LEADER, you don't know what you
    are talking about.

    > - Who knows XCCS and use it ? Very few people.

    Today, yes. But it was a key source of character repertoire for
    Unicode 1.0, and choices made in the XCCS often guided thinking
    about character/glyph distinctions for Unicode.

    > - Who uses leaders ? Every publisher and author of long documents
    > that do not want to see irregularily spaced leaders, or a dotted
    > grid instead of a true dotted horizontal line.

    This is irrelevant to the claims you have been making about U+2024.

    >
    > Leaders are visual helpers for the eye of readers, they have
    > absolutely no punctuation or symbolic semantic (unlike the
    > two-dots symbol or the ellipsis). The fact that it was categorized
    > as a punctuation is probably an initial error

    It was not. The error is your assumption that the TWO DOT LEADER
    was encoded to represent the convention of using <U+002E, U+002E>
    to indicate a range.

    > that can' be corrected and that comes from the classification
    > of its approximative fallback "compatibility decomposition".
    >

    > So you seem to mix the very distinct concept of compatibility
    > characters and compatibility decompositions:

    I see...

    [*looks around the office to see who else it was who wrote that
      text in Chapter 2*]

    ...but I do appreciate the coals delivered to Newcastle. ;-)

    > - compatibility characters are for the initial mapping from an
    > important legacy encoding with full roundtrip, and the
    > exact semantic is preserved in this mapping to Unicode. The usage
    > of these Unicode codepoints is discouraged out of this legacy usage.
    >
    > - characters that have compatiblity decompositions are intended
    > as guides for acceptable fallback characters that will not create
    > too confusive interpretation by readers, but the exact semantic
    > is not preserved with their compatibility decomposition. Their
    > usage is not discouraged but instead favored by Unicode which
    > adds important semantics in the "composed" character.

    I won't desconstruct this sentence by sentence. But use of
    compatibility characters is not discouraged. *Some* of them
    are deprecated; *some* of them are inappropriate for particular
    uses; *some* of them are, in fact, required for other contexts.
    It depends on what you are doing in your implementations.

    Compatibility decompositions were *not* defined as
    guides for acceptable fallback. They can be used as part of
    a fallback conversion implementation, but fallback is a much
    more general problem, and applies to characters that have no
    decompositions and to characters with canonical decompositions,
    as well.

    Finally, some compatibility decomposable characters are not
    only discouraged, they may even be "strongly discouraged", for
    one reason or another. See, for example, U+0F77 and U+0F79.

    I'd advise more care in making unjustified generalizations
    and then proclaiming them to the unicode list as if they
    were expert opinions.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 30 2003 - 22:33:32 EDT