Re: Another take on the English apostrophe in Unicode

From: Marcel Schneider <>
Date: Tue, 16 Jun 2015 19:08:05 +0200 (CEST)

On Sat, Jun 13, 2015, Mark Davis wrote:

> In particular, I see no need to change our recommendation on the character used
> in contractions for English and many other languages (U+2019). Similarly, we wouldn't
> recommend use of anything but the colon for marking abbreviations in Swedish, or
> propose a new MODIFIER LETTER ELLIPSIS for "supercali...docious".

> (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.)

On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️ wrote:

> On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider wrote:

>> When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed.

>> [...]

> Quite nice of you to inform me of the core mission of Unicode—I must have somehow missed that.

I was rather astonished and amused when I read I could have aimed at informing you of Unicodeʼs core. The goal was to check Iʼm at the right level. Well, there would have been another manner to say it... which didnʼt come at mind to me.

However, what surprises me even more as I think about, is while knowing all on Unicode, youʼve got just a weak opinion on which apostrophe recommendation is the right one...

> More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits.

Itʼs another proof of Unicodeʼs professionalism as to have thought about distinguishing DIAERESIS and UMLAUT. Despite of being a French-German bilingual and knowing the diacritics, I encountered that first in Microsoftʼs kbd.h, where the one is called DIARESIS and is mapped to UMLAUT. Iʼm not a friend of such distinctions (except in vocabulary and grammatics), because in writing practice they would be but useless and counterproductive complications. An abbreviation dot would have been much more useful, but to deploy its benefits, it would have needed a supplemental key mapping. On this background, Unicodeʼs choice of recommending to disambiguate the apostrophe is even more meritorious. I see it as a proof that there is really a good reason that people mind at the difference whenever they donʼt use the ASCII apostrophe for all of them. What would have bugged Microsoft then, was that it could have to implement this difference in its word processing and desktop publishing software, and to tell users about. Nothing easier for Microsoft with all the Help and Info! “The new smart quotes help you to check whether you need an apostrophe or a quote. This makes quotes conversion easy.” Or the like.

> In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes.

Based on the Unicode principle to encode characters, not glyphs, I doubt whether two characters may be called _essentially_ identical when they look the same. A huge subset of the Code Chartsʼ xrefs is to help font designers on this point. About people mixing up, they are most likely to do so when the keyboard allows only one of two. This is not the case of U+02BC and U+2019, none of whose is on standard keyboards. Here itʼs the smart quotes algorithm which will mix up! And this one is easily helped not to do so, since itʼs embedded in high-end software with all its display and shortcut capabilities. Eventually, the only one who wanted to keep mixing up was—guess who?—Microsoft.

The reason? Word processing that depends on distinction between opening and closing quotation marks, which needs a very tiny algorighm, is much easier to implement than processing that depends on distinction between apostrophe and simple closing quotation mark, and between apostrophe and simple quotation marks on the whole. Informal English word forms are so rich and varying that some are ambiguous and scarcely any software dictionary can contain them all. But even formal English is not wholly supported since nested quotes often are not. Why would users not be interested in improved software, even if it would cost a little more?

About searching and equivalence classes: There is already plenty of equivalence implemented in the simplest search algorighm: casing! A class more with (U+0027, U+02BC, U+2019) wouldnʼt change that a lot.

>So we only separated essentially identical characters in limited cases: such as letters from different scripts.

I repeat myself: Calling like-looking glyphs “essentially identical characters” is inconsistent with Unicodeʼs encoding characters, not glyphs. But whatever, I repeat myself again: Under these circumstances, Unicodeʼs recommendation of preferring U+02BC for apostrophe weighs the heavier!

Best regards,
Marcel Schneider
Received on Tue Jun 16 2015 - 12:09:08 CDT

This archive was generated by hypermail 2.2.0 : Tue Jun 16 2015 - 12:09:08 CDT