RE: Apostrophes at www.unicode.org

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 22 2007 - 08:11:23 CDT

Next message: Peter Constable: "RE: New Corrigendum to The Unicode Standard"

Previous message: Philippe Verdy: "feedback on UAX #29 : word breaks with hiragana and voiced marks"
In reply to: Jukka K. Korpela: "Apostrophes at www.unicode.org"
Next in thread: Asmus Freytag: "Re: Apostrophes at www.unicode.org"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I noticed that the page http://www.unicode.org/standard/WhatIsUnicode.html
> contains the APOSTROPHE U+0027 in many of the names of the translations
> (e.g. Albanian, French, Italian, Maltese) as well as elsewhere. I did no
> find any obvious feedback address and, moreover, this seems to be a matter
> of principle rather than just a technical fix.

It's true, and this included the UDHR translations too. Note that The
English UDHR text does not use any apostrophe or single quote, so it is
correct, even though such text is unusual and common English contains lots
of apostrophes.

> Since the Unicode Standard says that RIGHT SINGLE QUOTATION MARK U+2019 is
> the preferred character for a punctuation apostrophe, shouldn't the
> Unicode Consortium's web pages use that character? Especially on pages
> that demonstrate the power of Unicode in presenting texts in different
> languages correctly.

But note that the automatic replacement of the dactylographic apostrophe
(vertical) by the typographic apostrophe (U+2019) is not the only choice for
all languages, for example in Hawaiian, Tahitian it should be the other
apostrophe-like character (in fact a letter in those Austronesian languages
for noting glottal stops) in most cases but not always (because they also
use the apostrophe, for noting suppressed letters, so there does exist a
semantic difference): U+2018 (which has been accepted for inclusion in
UAX#29 for the list of "MidLetters" in word breaks).

But Unicode also has an apostrophe letter, distinct from the right single
quotation mark. I just wonder for which language it is intended (and may be
the name is not suggestive enough, as I suspect that this apostrophe-like
letter is not a true apostrophe but, like in Austronesian languages, a
letter too.

> The question extends to the translations themselves, which use U+0027. On
> monolingual pages in a Latin script, the use of U+0027 as a
> typographically wrong but "safe" replacement for U+2019 might be
> defendable, though support to U+2019 is fairly universal now (in web
> browsers and in fonts).

This support of U+2018/U+2019 in fonts is quite common just because they
were also part of many legacy charsets in addition to the ASCII vertical
quote since long (at least in all Windows charsets). Your concern should be
extended to the U+201C/U+201D quotation marks pair for the same reason (and
also U+201E, the double low-9 quotes, which was also added in those Windows
codepages for use in Spanish).

> Note: The pages seem to use the character reference ' instead of
> U+0027 itself. I have seen many people assume that this makes a difference
> and that ' is the proper character for a punctuation character. Yet in
> reality it means U+0027 and nothing else.

There's no difference, except syntactically. When the charater reference is
used,it's most often to avoid a syntactic problem where the ASCII quote
plays another role:
* in the WikiMedia syntax, the ASCII quote is used to note bold and italic
styles, using 2 or 3 quotes, and when you need to place an apostrophe at the
boundary of a bold or italic word, the quotes are counted to determine what
to do. The algorithm is now smarter and tries to match the leading and
trailing italic/bold sequences in pairs, so the remaining quote is left
unchanged and ' is not needed, but there are still cases were the
reference is needed, if the character is the first one of a template
parameter which will be italicized by this template.
* in scripting languages like PHP and other programming languages, the
single quotes are terminating a string syntaxically, and using a reference
is an alternative. Another alternative used in those languages are the
notations \033 (octal, but using the local compiler charset and not
Unicode), or \x27 (local compiler charset too) or \u0027 (explicitly
Unicode, but using UTF-16 only for example in Java, the \u notation needs to
be used for each surrogate making a single character out of the BMP) or
\U00000027 (UTF-32, no surrogate needed).

So those people think that ' is safer only because it does not cuase
them syntactic problems. They finally think that this is the only correct
way to encode it, even though they ignore that this is the same character as
the simple dactylographic quote on their keyboard where it works. Using
' or even a named reference is overkill in a HTML-only document, but is
still needed when the character is part of some JavaScript fragment.

Next message: Peter Constable: "RE: New Corrigendum to The Unicode Standard"
Previous message: Philippe Verdy: "feedback on UAX #29 : word breaks with hiragana and voiced marks"
In reply to: Jukka K. Korpela: "Apostrophes at www.unicode.org"
Next in thread: Asmus Freytag: "Re: Apostrophes at www.unicode.org"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 22 2007 - 08:15:27 CDT