RE: Apostrophes at www.unicode.org

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 22 2007 - 08:11:23 CDT

  • Next message: Peter Constable: "RE: New Corrigendum to The Unicode Standard"

    > I noticed that the page http://www.unicode.org/standard/WhatIsUnicode.html
    > contains the APOSTROPHE U+0027 in many of the names of the translations
    > (e.g. Albanian, French, Italian, Maltese) as well as elsewhere. I did no
    > find any obvious feedback address and, moreover, this seems to be a matter
    > of principle rather than just a technical fix.

    It's true, and this included the UDHR translations too. Note that The
    English UDHR text does not use any apostrophe or single quote, so it is
    correct, even though such text is unusual and common English contains lots
    of apostrophes.

    > Since the Unicode Standard says that RIGHT SINGLE QUOTATION MARK U+2019 is
    > the preferred character for a punctuation apostrophe, shouldn't the
    > Unicode Consortium's web pages use that character? Especially on pages
    > that demonstrate the power of Unicode in presenting texts in different
    > languages correctly.

    But note that the automatic replacement of the dactylographic apostrophe
    (vertical) by the typographic apostrophe (U+2019) is not the only choice for
    all languages, for example in Hawaiian, Tahitian it should be the other
    apostrophe-like character (in fact a letter in those Austronesian languages
    for noting glottal stops) in most cases but not always (because they also
    use the apostrophe, for noting suppressed letters, so there does exist a
    semantic difference): U+2018 (which has been accepted for inclusion in
    UAX#29 for the list of "MidLetters" in word breaks).

    But Unicode also has an apostrophe letter, distinct from the right single
    quotation mark. I just wonder for which language it is intended (and may be
    the name is not suggestive enough, as I suspect that this apostrophe-like
    letter is not a true apostrophe but, like in Austronesian languages, a
    letter too.

    > The question extends to the translations themselves, which use U+0027. On
    > monolingual pages in a Latin script, the use of U+0027 as a
    > typographically wrong but "safe" replacement for U+2019 might be
    > defendable, though support to U+2019 is fairly universal now (in web
    > browsers and in fonts).

    This support of U+2018/U+2019 in fonts is quite common just because they
    were also part of many legacy charsets in addition to the ASCII vertical
    quote since long (at least in all Windows charsets). Your concern should be
    extended to the U+201C/U+201D quotation marks pair for the same reason (and
    also U+201E, the double low-9 quotes, which was also added in those Windows
    codepages for use in Spanish).

    > Note: The pages seem to use the character reference ' instead of
    > U+0027 itself. I have seen many people assume that this makes a difference
    > and that ' is the proper character for a punctuation character. Yet in
    > reality it means U+0027 and nothing else.

    There's no difference, except syntactically. When the charater reference is
    used,it's most often to avoid a syntactic problem where the ASCII quote
    plays another role:
    * in the WikiMedia syntax, the ASCII quote is used to note bold and italic
    styles, using 2 or 3 quotes, and when you need to place an apostrophe at the
    boundary of a bold or italic word, the quotes are counted to determine what
    to do. The algorithm is now smarter and tries to match the leading and
    trailing italic/bold sequences in pairs, so the remaining quote is left
    unchanged and ' is not needed, but there are still cases were the
    reference is needed, if the character is the first one of a template
    parameter which will be italicized by this template.
    * in scripting languages like PHP and other programming languages, the
    single quotes are terminating a string syntaxically, and using a reference
    is an alternative. Another alternative used in those languages are the
    notations \033 (octal, but using the local compiler charset and not
    Unicode), or \x27 (local compiler charset too) or \u0027 (explicitly
    Unicode, but using UTF-16 only for example in Java, the \u notation needs to
    be used for each surrogate making a single character out of the BMP) or
    \U00000027 (UTF-32, no surrogate needed).

    So those people think that ' is safer only because it does not cuase
    them syntactic problems. They finally think that this is the only correct
    way to encode it, even though they ignore that this is the same character as
    the simple dactylographic quote on their keyboard where it works. Using
    ' or even a named reference is overkill in a HTML-only document, but is
    still needed when the character is part of some JavaScript fragment.



    This archive was generated by hypermail 2.1.5 : Wed Aug 22 2007 - 08:15:27 CDT