Re: U+0140

From: Kenneth Whistler (
Date: Mon Apr 19 2004 - 16:03:07 EDT

  • Next message: Kenneth Whistler: "Re: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler"

    John Hudson responded to Michael Everson:

    > Michael Everson wrote:
    > >> This would make the mid-dot too high. The top dot of the colon usually
    > >> sits toward the top of the x-height; the *mid*-dot should sit lower,

    > > John, I just don't believe you. I don't believe that in all the history
    > > of Greek and Catalan typography this careful hairsplitting has *always*
    > > taken place; certainly in scientific transcription the HALF TRIANGULAR
    > > COLON is just the top dot in the TRIANGULAR COLON, and in Americanist
    > > transcription where the dot-colons are used instead of triangles I would
    > > say the same applies.
    > I never contested that the dots of a colon correspond to the triangles of the linguistic
    > long vowel marker. They clearly do. What I contested was that the typographic mid-point
    > (U+00B7) corresponded to the top dot of a colon. It clearly does not. It is called a
    > mid-point because it sits midway up the x-height. It is used in this position for a
    > variety of stylistic purposes, ...

    I think we have two typographers here arguing somewhat at cross-purposes.
    Clearly the typographic "mid-point" behaves as John has mentioned, and is
    designed as such in many fine fonts (examples seen among the exhibits that
    Asmus gathered).

    But just a clearly, there is a long, long tradition in Americanist
    orthographic practice (which is used widely for linguistic orthographies
    outside of Native America as well) of using a "raised dot" for an indication
    of vocalic (and occasionally consonantal) length. For 100 years, that
    raised dot was mechanically generated by, among other means, filing the
    lower dot off a colon key on a mechanical typewriter. (I have such a
    typewriter sitting on my desk.) Linguists got used to this raised dot
    height, coordinated with a colon in design (which then could be used, among
    other things to indicate a prolonged length, when two degrees of length
    were in question), and that preference made its way into print, at least
    for many North American languages, where the raised dot could be printed
    at x-height, rather than at midway up the x-height, which would be too
    low for most of the linguistic usage.

    Enter the electronic age. ASCII had no MIDDLE DOT. It was period (.), colon (:)
    or the highway. Early linguistic material on computers made do with those,
    because they had no choice. The IBM PC and the Macintosh introduced a
    MIDDLE DOT (0xFA [= IBM CDRA SD630000 "Middle Dot"] and 0xE1, respectively).
    When ISO 8859-1 was defined, it also had a MIDDLE DOT (0xB7). *Everybody*
    made use of that MIDDLE DOT for anything that was vaguely in the ballpark --
    the typographical mid-point, the linguistic length mark, the mathematical
    multiplication operator, the Greek ano teleia, the dictionary hyphenation
    point, and, yes, the Catalan middle dot. The fact that each of those usages
    might have extremely fine typographical hairs to split regarding the rendering
    was so much horsepucky as far as the character identity was concerned. You
    used what you had available to represent your data.

    The Unicode Standard, for a variety of reasons -- some of which included
    compatibility mapping concerns to other standards which had started to
    proliferate middle dots -- added a collection of middle dots *besides*
    U+00B7, *the* middle dot, to its repertoire. Those other middle dots give
    people textual representation alternatives now, if they need to make
    distinctions, and textual rendering alternatives, if they need to make
    middle dots which display with slightly different heights, sizes, or
    spacings, depending on the rendering requirements.

    What is clear, however, is that it is utterly impossible to satisfy
    everybody regarding middle dots. Typographical purists will always want
    plain text to make more distinctions. Text processing requirements will
    abhor the splitting of text representation into more and more difficult-to-
    distinguish glyph representations without clear semantic differences.
    And dot proliferation *always* poses difficulty for establishing
    character properties.

    Before people bluster on too much further on this thread, it would
    be good for everyone to recall that the *reason* why U+00B7 has
    problematical properties is that it was inherently ambiguous in
    *preexisting* usage (that is, prior to Unicode altogether) as punctuation
    versus length mark (and other things as well). This puts it in the
    same grabbag of very difficult, ambiguous ASCII characters, such as
    "~", "*", and "'" which also acquired conflicting usages during their
    reign among the small set of available punctuation and symbols in

    History has consequences. The history of a character's encoding also
    has consequences for how the Unicode Standard is to be used and


    This archive was generated by hypermail 2.1.5 : Mon Apr 19 2004 - 16:56:36 EDT