Re: apostrophes

From: Steve Summit (
Date: Sat May 20 2006 - 10:58:31 CDT

  • Next message: David Starner: "Re: apostrophes"

    Thanks for your thoughtful reply, Jukka. I hadn't fully thought
    about the lexical class of U+02BC. This is clearly the crux of
    the matter. But if we think about it carefully, I'm not at all
    sure what we mean when we talk about the character which, as you
    say, "we commonly regard as (punctuation) apostrophe".

    I'm beginning to think that this idea of a "punctuation
    apostrophe" is a notional one that exists only in our heads, a
    vestige of the bad old days when U+0027 tried to do everything
    and clearly had to be regarded as a punctuation character.
    Now that we've separated out U+2018 and U+2019 for the quotation
    mark uses, and U+2032 for prime, I'm really not sure we ever have
    to think about the plain apostrophe as a "punctuation" character
    any more!

    In particular, let's look at the Unicode Standard's current text
    on the matter, which says in section 6.2 ("General Punctuation",
    page 159 in version 4.0.0):

            Letter Apostrophe. U+02BC MODIFIER LETTER APOSTROPHE is
            preferred where the apostrophe is to represent a modifier
            letter (for example, in transliterations to indicate a
            glottal stop). In the latter case, it is also referred
            to as a letter apostrophe.

            Punctuation Apostrophe. U+2019 RIGHT SINGLE QUOTATION
            MARK is preferred where the character is to represent a
            punctuation mark, as for contractions: "We've been here
            before." In this latter case, U+2019 is also referred to
            as a punctuation apostrophe.

    Now, the key question is: even in English, in what sense is the
    apostrophe in the word "we've" actually a punctuation character?

    If you're using a graphical, mouse-based environment, move the
    mouse cursor anywhere over the word "we've" and double click.
    You get the whole word, as clearly you should. Run the word
    through your spellchecker. It does not report "ve" as a separate
    word which is misspelled, as clearly you would not want it to.
    In both of these cases, the apostrophe (the plain old ASCII
    apostrophe) is being treated as part of the word. I've written
    word-matching code any number of times, and I tend to treat the
    apostrophe as a letter for this purpose.

    There are complications, of course. If you put the word 'we've'
    in single quotes and double click, you tend to get just the word,
    not the quotes. And of course this is because the software is
    using some fancier heuristics, treating an embedded apostrophe,
    with alphabetics on both sides, differently. But then you get
    aberrations with possessives, as can be seen when comparing
    "my parents' house" and "my sister's house". *BUT*, and this
    is the key point, if the apostrophe character were encoded using
    a different code point than the single quote character, the
    software wouldn't have to resort to those ambiguous heuristics.

    In English, I think the only sense in which the apostrophe is not
    a "letter" is that we don't list it along with the other 26 when
    we say our ABC's, and we don't treat it as significant in
    alphabetization. But, really, surprising as it seems, I think
    it's much more like a letter than punctuation; and furthermore,
    this is as true in English as in other languages which explicitly
    call their apostrophe a "modifier letter". I don't think there's
    much useful distinction to be made -- syntactic, semantic, or
    otherwise -- between the contractive apostrophe in the English
    "we've" and, say, the apostrophe-as-glottal-stop in languages
    that use it that way. (And, certainly, any such distinction is
    considerably *less* significant than the distinction between
    apostrophe and closing single quote!)


    >> (Me, I'd really like to distinguish apostrophes from quotes in
    >> textual data, as they're obviously quite different semantically.)
    > Many people have expressed the same view. It would meant that a new
    > character would have been defined, for unambiguous use as punctuation
    > apostrophe.

    Well, yes, but only if you think that the "punctuation
    apostrophe" is punctuation! Back when the first Unicode
    Standard came out, I really did think that not one but several
    "new characters had been defined", and that they had been defined
    precisely so that I could start making useful distinctions,
    e.g. between U+02BC for the true apostrophe and U+2018 and U+2019
    for the quotes. It seemed odd that U+02BC was a "modifier
    letter" and not the punctuation I still thought it was, but I was
    prepared to overlook this, because the distinct code point was
    distinct, and the glyph appearance was correct, and everything
    would have worked out fine. (But then Unicode 3.0 went and took
    away the useful, liberating new distinction I thought I'd been
    granted, and I was crushed.)

    > I don't think traditional or modern typography ever distinguises
    > between a punctuation apostrophe and a right single quotation
    > mark...

    Certainly not, which is why we now have the curious situation
    that a character whose name is "Right Single Quotation Mark"
    can carry a recommendation saying that it is also "the preferred
    character for apostrophe".

    (Although with that said, I'm noticing that the visual appearance
    of U+02BC and U+2019 under several of the systems I use does tend
    to be different, although I suspect that this is due much more to
    accidents of implementation than to deliberate design.)

    > Thus, the difference would be _purely_ semantic. Would people
    > really want to make such distinctions in writing?

    Probably not in casual writing, but certainly in precise data
    encoding. Wouldn't it be nice, for example, if you could
    mechanically replace all the single quotes in a document with
    double, or check for proper open/close quote pairing, without
    having to worry about the apostrophes?

    But certainly, a big part of this issue is that, practically,
    people don't have a good way of making such distinctions in
    writing. There's still only one key on most keyboards for both
    apostrophe and single quote, and the "smart quotes" feature of
    e.g. Microsoft Word is usually able to, in effect, turn that
    one key into two, but not into three. So if U+02BC is listed
    as the "preferred character for apostrophe", people who have no
    convenient way of entering it have to feel guilty that they're
    doing something wrong. (In fact, in cynical moments, I have
    almost concluded that the main reason for the change in Unicode's
    recommendations over time about U+02BC and U+2019 was just to
    reduce this guilt among users of Microsoft Word.)

    Today, however, if I want to reduce ambiguity by reserving
    U+2018/U+2019 for quotes, and using U+0027 or U+02BC for
    apostrophes, I get beat up for it: people point out Unicode's
    recommendation that U+2019 is preferred, and now *I* have to
    feel guilty.

    > Similarly, the use of the full stop character "." as a sentence
    > termination (period) is semantically quite distinct from its use in
    > abbreviations (as in "Mr."), and its use as a decimal separator (in
    > English) or as a thousands separator (in many other languages) are
    > semantically distinct, too.

    Funny you should mention that -- just yesterday I was realizing
    that having distinct code points for "full stop" versus "decimal
    point", and "comma" versus "thousands separator", would be quite
    useful, especially when doing on-the-fly conversion of text to
    properly locale-representative forms.

    > Making distinctions on purely semantic grounds, for a character
    > that is commonly understood as one character with multiple uses,
    > would apparently have opened a can of worms.

    "Would have"? Remember that Unicode has done exactly that in
    several other places as well! We've split off U+2010 Hyphen,
    U+2013 En Dash, and U+2212 Minus Sign from the old, ambiguous,
    ASCII, U+002D Hyphen-Minus. We've split off U+2044 Fraction
    Slash and U+2215 Division Slash from U+002F Solidus. (Granted,
    I'm not sure anyone makes use of all these disambiguations,
    though of course in typography the true hyphen is distinct.)
    We've got U+212B Angstrom Sign distinct from U+00C5 Latin Capital
    Letter A with Ring Above, and several other glyph-identical
    characters in the 21xx Letterlike Symbols block. We've got
    U+00B5 Micro Sign distinct from U+03BC Greek Small Letter Mu,
    although of course that one was forced on us by ISO 8859-1.

    I'm not sure why the can of worms is so much squirmier for
    apostrophes than for the other characters. I was *hoping* that
    what would change, over time and with the help of Unicode's
    new distinctions, was that people's "common perception of one
    character with multiple uses" would be reduced, that people would
    start to recognize the distinctions. Unfortunately, in the case
    of apostrophes, we've slid backwards, and Unicode has changed to
    reflect the common perception that apostrophes and close single
    quotes are still the same.

    This archive was generated by hypermail 2.1.5 : Sat May 20 2006 - 11:06:38 CDT