Re: Naming of functional ASCII characters in Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jun 05 2000 - 13:46:49 EDT


Bernd Warken suggested:

> The Unicode ASCII range U+00-7F still shows elements of the out-dated
> glyph approach instead of the intended character abstraction. This mail
> tries to point out some places where Unicode uses a text-oriented
> naming, tho a functionally oriented naming would be more suitable.

I'm sure you'll get other replies on this list. But the fundamental
issue you are up against is that the names of ISO/IEC 10646 (= Unicode)
characters are normative and fixed in concrete now. Neither of the
pertinent maintaining committees (JTC1/SC2/WG2 and the UTC) will
entertain *any* name changes. The names themselves are used in various
products which depend on them being standard and non-changing.

> Historically, the 7-bits ASCII characters were used for databases and
> programming languages. In later years, text processing required better
> representations for some of these functional characters. This led to
> extensions like the well-known code-pages, ISO character standards, and
> Unicode.

ASCII characters have long been used for far more than just database
stores and programming languages. They have long been used in email
and text, and underlie nearly all of the extant character encodings
(besides the significant exception of EBCDIC code pages).

That said, the Unicode Technical Committee is fully aware of the
problems with the ASCII characters' ambiguous names and overloaded
functionality. Much of this is documented, in fact, in the Unicode
Standard.

>
> So the primary task of the ASCII-7 code is programming, not text
> processing. This makes the ASCII characters primarily functional.
> Unicode usually honors this fact by providing alternative characters in
> order to get more beautiful, printable look-alikes.
>
> Unfortunately, some names (and glyphs) do not reflect this functional
> meaning.

The names are normatively derived from ISO 646 (and ASCII). They will
not be changing. The Unicode Standard provides numerous aliases, however,
which indicate something of the ambiguous usage of these characters.

The glyphs printed in the Unicode Standard are well within the standard
range of display for ASCII characters.

>
> This might not seem a big problem today, but there are some long-term
> considerations in some interpreter languages to include wide characters
> in writing program code. At this point, the difference between
> functional characters and printable representations will become crucial.

I disagree. There are ambiguities in the usage of quotation marks
in ASCII, granted. But that is a well-known, longstanding problem that
is already addressed in ASCII interpreters. Making these wide characters
and supporting Unicode doesn't fundamentally change anything here.

>
> " U+0022 QUOTATION MARK
>
...
>
> Renaming it to DOUBLE QUOTE, would uniquely characterize the character,
> mark it as a functional character, and relate it to the other ASCII
> quotes. So it really could be renamed to DOUBLE QUOTE, maybe with the
> additional name NEUTRAL QUOTATION NAME.

Renaming it changes nothing about its usage. And it isn't going to be
renamed. That same name is used in ASCII, ISO 646, all of the ISO 8859
parts, and in many other standards.

> ' U+0027 APOSTROPHE
>
> The Unicode documentation for this character says `preferred character
> for apostrophe is 2019'. So the main name APOSTROPHE is even documented
> as being wrong.
>
> It should be renamed according to its function, i.e., SINGLE QUOTE or
> RIGHT QUOTE. These names were used for decades, before Unicode changed
> it. The 3rd name `APL quote' looks like a possible candidate, but I
> never found this name in a software documentation.
>
> Again APOSTROPHE is text-oriented, while * QUOTE is functional.

I don't think this is a valid distinction.

>
> A second problem with this character is its construction as `a neutral
> (vertical) glyph having mixed usage'. ASCII-7 based terminals (the
> basis programming environment) usually display the single quote as a
> raised 9 quote or with a right slope, making it into a `right quote' and
> not a `mixed vertical quote'. Moreover, its inclination helps to
> distinguish it from its antipode ` (U+0060).

Here is where you are introducing a problem. In programming languages,
single quotes are indicated by pairs of U+0027, just as double quotes
are indicated by pairs of U+0022. They do *not* pair U+0060 with
U+0027 to form a left and right quote. That usage is a widespread
email convention -- not a formal language syntax convention. And it
has caused no end of problems, in Unix systems in particular. There
was a long thread on this problem about a year ago on this list, regarding
how the glyphs for new Unix fonts should be resolved for U+0060 and
U+0027. I believe the issue was resolved already. Perhaps others on
the list would care to provide more details.

>
> Maybe some new character VERTICAL APOSTROPHE should be defined for the
> actual concept.
>
>
> ` U+0060 GRAVE ACCENT
>
> Again the naming indicates a text-oriented approach. Like before, this
> character is basically functional, e.g., in POSIX shell programming.
>
> The pointers in the Unicode documentation show that for each of the
> different textual usages an alternative look-alike character was
> specified. So only its functional meaning is left and should be
> reflected in its naming.
>
> Traditionally, this character was called `back-quote' or `left quote' to
> correlate it with ' (U+0027). So it should be renamed to BACK-QUOTE.
> Depending on how the glyph is constructed, its additional name should be
> LEFT QUOTE or STAND-ALONE GRAVE ACCENT.
>
>
> ^ U+0059 CIRCUMFLEX ACCENT
>
> This character has always been known as the `caret' in POSIX wildcards
> and regular expressions. So it should be renamed to CARET, possibly
> with the additional name STAND-ALONE CIRCUMFLEX ACCENT. The IPA
> character U+028C LATIN SMALL LETTER TURNED V shouldn't interfere.

In ASCII, this character is ambiguous between these functions.

As for the other characters, its name is not going to be modified.

> - U+002D HYPHEN-MINUS
>
> The name HYPHEN-MINUS is not suitable, for there is already a printable
> hyphen, a printable minus sign, and several dashes.
>
> Again, this character is a functional character for programming
> languages (MINUS OPERATOR) and system administration (OPTION CHARACTER);
> it should reflect these names.

This kind of reasoning could lead to suggestions that U+003F QUESTION MARK
be renamed to "HELP CHARACTER", or "CONVERSION SUBSTITUTION GLYPH", and so on.

Basically, these suggestions for emendations of character names to
reflect functions constitute a black hole -- any amount of effort
expended in this direction will simply be swallowed up, marginally
increasing the gravity of the black hole, but otherwise never be
heard from again.

>
> The best name would be MINUS SIGN. Unfortunately, U+2212 is already
> called like that. But this character could be renamed to what it is, a
> PRINTABLE MINUS SIGN.

All of the minus signs are printable, so I don't see that this is
much help in making distinctions. The true distinctions between
some functionally different characters are carried by the *properties*
of the characters:

002D;HYPHEN-MINUS;Pd;0;ET;;;;;N;;;;;
2212;MINUS SIGN;Sm;0;ET;;;;;N;;;;;

Pd = dash subcategory of punctuation
Sm = math subcategory of symbol (i.e. a math operator)

Further detailed specification of character properties is the way
the Unicode Technical Committee clarifies distinctions between usages
of characters -- not by fiddling with their names.

--Ken

>
>
> PRINTABLE
>
> To avoid double names for characters, a prefix like PRINTABLE could be
> used for look-alikes of other functional characters, esp. in additional
> names.
>
>
> Copyleft 2000 by Bernd Warken <bwarken@mayn.de>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT