Re: Best 10646/Unicode chara

From: Mark Davis (mark_davis@taligent.com)
Date: Thu Jul 04 1996 - 17:13:43 EDT

Next message: Tom Fruchterman: "best character for apostrophe -- limitations of OCR"
Previous message: Jonathan Rosenne: "Re: RFC 1556"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Reply to: RE>Best 10646/Unicode character for apostrophes?

> Visual identity in _all_contexts implies character identity (except when
different scripts are involved)

In designing Unicode, we went to great pains to avoid characters with
identical glyphs, just because of the confusion that it can cause. But we have
to be a bit clearer what that entails.

The way I think of it is that for each Unicode character there is a range of
glyphs that can represent that character. But those glyphs can be better or
worse representations:
* Core.
Some of these glyphs are core, and would be recognized as reasonable
representations by essentially all people familiar with the character.
* Grey.
Then there is a grey area; glyphs that are only recognized as representations
by a certain community, or only recognized in context.
* Outcasts.
Finally, there are glyphs that would not under any reasonable circumstances
represent that character, such as a "B" glyph for the "x" character.

If both the core and grey areas are the same for any two purported characters,
then there is a great opportunity for confusion. Unicode only encodes these as
different characters to maintain historic source-standard compatibility. Where
the core glyphs are the same, but the grey areas differ, then it is a matter
of judgement whether one or two characters should be encoded: it depends on
the degree and circumstances of overlap.

There are a very few cases where there are no compatibility requirements, and
Unicode encode redundant characters; I view those as mistakes, and agree with
you that U+02BC MODIFIER LETTER APOSTROPHE and U+2019 RIGHT SINGLE QUOTATION
MARK should be treated as compatibility equivalents; as should some of the
middle dots.

Some members have felt that it is worth making such distinctions. For example
suppose you have the text "<left-quote>can<apostrophe>t<right-quote>". Since
the characters are distinguished, you can semantically parse this correctly to
know that the apostrophe is actually treated as part of the word.

In practice, as in other cases, I think that it is a vain hope that users will
always choose to insert the correct choice of two identical characters! So
your parsing algorithm has to be smart enough to figure out the bad case or
swapped characters anyway!

As to the second point, you are correct in that the informative statements in
the book should really be restated as something like:
: preferred character for opening single quotation mark in English
is 2018
: preferred character for closing single quotation mark in English
is 2019
: many languages have different conventions for quoted text.

Unfortunately, it is too late to get into the next edition of the book
(camera-ready copy was just sent to the printers!), but we can bring both of
these issues up for discussion at the next Unicode meeting, aiming for the
next edition, and our planned web updates.

Mark

(By the way, the Unicode characterization of 0027 is not a "theory"!)

--------------------------------------
Date: 7/3/96 1:02 PM
To: Mark Davis
From: Olle Jarnefors
(I crosspost this message to the ISO10646 and Unicode
lists, in the hope of provoking comments from
extra-European experts.)

On the tc304wg2 list Michael Everson wrote, quoting me:

> >(Note that in MES, U+2019 (RIGHT SINGLE QUOTATION MARK),
> >not U+0027 (APOSTROPHE), should be used as single quote.
>
> "Should"? I use U0027 in my Web pages.

You're right and I was wrong. The correct coding of
apostrophes and single quotation marks in 10646/Unicode
isn't as simple as I thought when I wrote that. There
seems to be too many possible characters to use in
10646/Unicode. Most WWW readers support too few, and
U+0027 is the least bad to use in the web, presently.

In ISO 10646, the character names indeed suggest that
U+0027 should be used for apostrophes and U+2019 as
single quote character. But ISO character names are
occasionally misleading, and there is a very good reason
for using the _same_ 10646/Unicode character for both
apostrophe and closing single quotation mark: Not even
using a microscope one can tell apostrophes from such
quote characters in print, and Visual identity in _all_
contexts implies character identity (except when
different scripts are involved).

(Many languages, including English and Swedish, use this
character for closing a single quote, but there may be
languages using some other quotation mark in this
situation.)

The Unicode book (version 1.0) has another theory:

: 0027 ' APOSTROPHE-QUOTE
: = ISO APOSTROPHE
: neutral (vertical) glyph having mixed usage
: preferred character for apostrophe is 02BC
: preferred character for opening single quotation mark is 2018
: preferred character for closing single quotation mark is 2019

(Here quoting mark conventions identical to those of the
English language are tacitly presupposed.)

The Unicode book is correct in the observation that
U+0027 is not a good representation of either apostrophe
or closing single quotation mark. It is merely the
"ambiguous" or "neutral" character, invented with
type-writers, that can be used for both apostrophes and
single quotes, when better characters are not available
(like on type-writers and in restricted character sets
like ASCII, ISO 6937, ISO 8859).

I don't agree with the Unicoders that different
characters should be used for apostrophe and the closing
single quotation mark (U+02BC MODIFIER LETTER APOSTROPHE
and U+2019 RIGHT SINGLE QUOTATION MARK). The are visually
identical, so very few persons that enter text (and no
OCR programs) can be trusted to consistently choose the
correct character. The distinction between these
characters is useless in practice, and one of them should
be classified as a compatibility character; I would
prefer U+02BC to be so classified.

Unfortunately, the Unicode character database
<ftp://unicode.org/pub/MappingTables/UnicodeData-2.0.12.txt>
which I suppose is the final version, does not include
any hint of the very intimate relationship between U+02BC
and U+2019.

/Olle

------------------ RFC822 Header Follows ------------------
Received: by taligent.com with SMTP;3 Jul 1996 13:00:18 -0800
Received: from taligent.com by mailserv.taligent.com (AIX 3.2/UCB 5.64/4.03)
          id AA24487; Wed, 3 Jul 1996 13:00:15 -0700
Received: from unicode.org by taligent.com with SMTP (5.67/23-Oct-1991-eef)
        id AA24444; Wed, 3 Jul 96 12:58:33 -0700
        for
Received: by Unicode.ORG (NX5.67c/NX3.0M)
        id AA11843; Wed, 3 Jul 96 12:52:29 -0700
Date: Wed, 3 Jul 96 12:52:29 -0700
From: unicode@Unicode.ORG
Message-Id: <9607031952.AA11843@Unicode.ORG>
Reply-To: Olle Jarnefors <ojarnef@admin.kth.se>
Errors-To: uni-bounce@Unicode.ORG
Subject: Best 10646/Unicode character for apostrophes?
To: unicode@Unicode.ORG

Next message: Tom Fruchterman: "best character for apostrophe -- limitations of OCR"
Previous message: Jonathan Rosenne: "Re: RFC 1556"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT