L2/05-096

Known Anomalies in
Unicode Character Names

Version	1
Authors	Rick McGowan and Ken Whistler
Date	April 22, 2005
This Version	n/a
Previous Version	n/a
Latest Version	n/a

Summary

This document provides a complete listing of all known typographical errors in Unicode character names. It also includes a selection of misnamed characters, misleading names, and other problems with character names. Due to stability policies, such errors in names cannot be corrected, but can cause difficulties to users and implementers.

Status

This document is a draft. It is supplied purely for informational purposes to the Unicode Technical Committee.

1 Introduction
2 List of Known Anomalies and Their Explanations
Appendix A: Zarqa and Zinor
Acknowledgements
References
Modifications

1 Introduction

Due to stability policies, errors in character names cannot be corrected. This policy dates from a famous ruckus in the early 1990s over the name of æ and Æ:

1993-07-08

Denmark is issuing this defect report to ISO 10646-1:1993 based on the naming of Danish, Faroese and Greenlandic letter "Æ" in upper and lower case and with acute accent. The character "Æ" is also used as letter in the Norwegian and Icelandic languages. Please find enclosed an official statement from the Danish Standards Association concerning the Danish letter "Æ". During the process of writing the ISO 10646-1:1993, the naming was correct - for example "LATIN CAPITAL LETTER AE" - in the second DIS. It was changed to "LATIN CAPITAL LIGATURE AE" in the final version of the ISO 10646-1 (1993). ...

This defect report took over two years to resolve, with Francophones and Scandinavians at loggerheads every step of the way, until DCOR No. 1 to 10646-1:1993 was published in 1996.

The Unicode Standard, being synchronized with 10646, was dragged along in this process:

Unicode 1.0:

U+00E6 LATIN SMALL LETTER A E
    = LATIN SMALL LETTER AE
    (the name in ISO 8859-1)

Unicode 1.1:

U+00E6 LATIN SMALL LIGATURE AE
    (synchronized with 10646-1:1993)
    = LATIN SMALL LETTER A E

Unicode 2.0:

U+00E6 LATIN SMALL LETTER AE
    (applied DCOR No. 1 to 10646-1:1993)
    = LATIN SMALL LIGATURE AE

The confusion this situation left in all of the standards documents from the 1993 - 1996 period had no beneficial consequences for implementations of æ and Æ. It left bitter feelings and led both committees to decide that past a certain point such defect reports would be noted but not acted upon, insofar as they were requests for changes in names of published characters in the standards.

The stability of published character names is far more important to the network of interdependent standards that refer to character encoding standards than is the correctness of the name.

People who are using the Unicode Standard must understand that it is a character encoding standard. It is not the Universal Encyclopedia of Writing Systems and Character Identity.

Unicode character names are normative for the purposes of the character encoding standard and those other IT standards that reference it. They are also immutable, by action of both SC2 and the UTC, because change of character names is almost as disruptive of the standards as changing code points for characters would be.

This does not mean that the Unicode Standard is dictating to anyone what the name of some letter in their writing system should properly be, whether in English or in any other language.

This should be obvious from ASCII characters, which have a long history of this kind of concern, well predating Unicode's involvement in character encoding. Take U+002F SOLIDUS. Not one American English speaker in 100,000 would call '/' a "solidus". Its name is "slash" or for older speakers, perhaps "slanted bar", and so forth. Use the term "solidus" and everyone will look blankly at you, except Classics professors wondering what Roman money has to do with it, or character encoding mavens, who know the term because they read ASCII code charts.

In some cases, annotations have been added to the names list in the Unicode Standard to document various problems, but to date there has been no full listing of all known problems. One of the reasons why the Unicode standard publishes many aliases in the Unicode names list is because there often are much better, more communicative names for particular characters, even in English than the normative names in the data file. Users are free to use such aliases and other names, as long as they are not mis-represented as corrections to the standard, but instead used as alternative, more useful names for characters in the standard.

2 List of Known Anomalies and Their Explanations

This section lists known problems with Unicode character names.

U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE

This is not actually a single letter.

U+01A2 LATIN CAPITAL LETTER OI
U+01A3 LATIN SMALL LETTER OI

Should have been called GHA.

U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE

This is actually based on a ligation of "ts", not an inverted glottal stop.

U+0238 LATIN SMALL LETTER DB DIGRAPH
U+0239 LATIN SMALL LETTER QP DIGRAPH

These are actually ligatures, rather than digraphs

U+025B LATIN SMALL LETTER OPEN E

This is a Latin epsilon. Not calling it thus was just dumb.

U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E

Actually CLOSED REVERSED EPSILON. The reason why not calling U+025B "epsilon" was dumb.

U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH

LATIN SMALL LETTER REVERSED FISHHOOK R WITH RETROFLEX HOOK

U+02C7 CARON
U+030C COMBINING CARON

Should have been called HACEK and COMBINING HACEK. The term "caron" appears to be an invention of some standards body, but the term is so heinously wrong, nobody will admit to having coined it and its origin may be lost in the mists of time.

U+034F COMBINING GRAPHEME JOINER

Despite its name, it does not join graphemes.

U+0598 HEBREW ACCENT ZARQA

Perhaps should have been called HEBREW ACCENT TSINNORIT. May also be used for zarqa when shown on accented non-final letter. See Appendix A.

U+05AE HEBREW ACCENT ZINOR

Should have been called HEBREW ACCENT ZARQA (= tsinor). See Appendix A.

U+0670 ARABIC LETTER SUPERSCRIPT ALEF

Not an Arabic letter, but a vowel sign.

U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

These three are actually ligatures, not letters.

U+0B83 TAMIL SIGN VISARGA

TAMIL AAYTHAM (and with disagreement about status as visarga)

U+0CDE KANNADA LETTER FA

Should have been called KANNADA LETTER LLLA. It has nothing to do with "Fa".

U+0F0A TIBETAN MARK BKA- SHOG YIG MGO

This character is used to indicate that a document is addressed to a superior (the "petition honorific"), but the Tibetan name actually indicates a superior addressing an inferior ("starting flourish for giving a command").

U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG

The tsheg mark is not restricted to intersyllabic usage, and would have been better named "TIBETAN MARK TSHEG".

U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR

This character is not a delimiter, but is a non-breaking version of the tsheg mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the shad mark (U+0F0D).

U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN

The syllable "BSKA-" does not occur naturally in Tibetan, and is a mistake for "BKA-" (cf. U+0F0A).

U+156F CANADIAN SYLLABICS TTH

Canadian Syllabics mistake for a pair of asterisks.

U+200B ZERO WIDTH SPACE

This isn't a "space".

U+2118 SCRIPT CAPITAL P

Should have been called CALLIGRAPHIC SMALL P or perhaps even WEIERSTRASS ELLIPTIC FUNCTION SYMBOL, which is what it is used for. It's not a capital "P" at all.

U+262B FARSI SYMBOL

SYMBOL OF IRAN (It was renamed because of the taboo on national symbols in ISO standards.)

U+327C CIRCLED KOREAN CHARACTER CHAMKO
U+327D CIRCLED KOREAN CHARACTER JUEUI

An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.

U+A015 YI SYLLABLE WU

This is not a syllable pronounced "wu", but is actually a syllable iteration mark.

U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET

A spelling error: "brakcet" should be "bracket".

U+FEFF ZERO WIDTH NO-BREAK SPACE

Byte Order Mark (Naming it ZWNBSP was a mistake from the start.)

U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

A spelling error: "fhtora" should be "fthora".

Appendix A: Zarqa and Zinor

OK, here's the rundown: There are two separate cantillation systems in the Hebrew Bible. One is used for Psalms, Proverbs and (most of) Job, (the "poetic" books, hence the "poetic system"), and one is used everywhere else. The two systems have structural similarities and share *some* graphemes, but not all. We are here dealing with accents with roughly the same shape (in modern printing; old manuscripts actually had them written slightly differently). In the prose system there is an accent called ZARQA, which is postposed (on or to the left of the last letter), and in the poetic system there is one called TSINOR (and also zarqa and vice-versa; each of these has many names) which has the same shape and placement and even an analogous function in the structure of the cantillations. There is another accent, only in the poetic system, called the TSINNORIT (a diminutive of tsinor), which occurs directly above its letter, and is (almost?) never on the last letter of its word. (More modern printing tends to put the zarqa right on top of its letter too, but that's just a printing preference). If you look closely at some old manuscripts, you can tell that tsinnorit has a slightly different shape than zarqa/tsinor.

In Unicode terms, we have ZARQA (U+0598) and ZINOR (U+05AE) [sic]. By the usual meanings of those names, those should properly be synonyms, the same accent, but they're not. Worse, even if we work with the fact that "zinor" is at least mnemonic of "tsinnorit," it's still the wrong way 'round: ZINOR has the combining class of above-postposed, and ZARQA is encoded to go directly above the letter. So, to encode a zarqa or a tsinor, you need to use ZINOR, and to encode a tsinnorit, you need to use ZARQA. Oooops. That's less than obvious.

-- Mark Shoulson

Acknowledgements

Thanks to Mark Shoulson, John Hudson, James Kass, and Andrew West for their contributions.

References

[10646]	International Organization for Standardization. Information Technology— Universal Multiple-Octet Coded Character Set (UCS). (ISO/IEC 10646:2003). For availability see http://www.iso.org/
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Feedback]	http://www.unicode.org/reporting.html For reporting errors and requesting information online.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Names]	Unicode Names List Documentation and Format http://www.unicode.org/Public/UNIDATA/NamesList.html For the latest data file see http://www.unicode.org/Public/UNIDATA/NamesList.txt
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Stability]	Unicode Standard Stability Policy http://www.unicode.org/standard/stability_policy.html For explanations of terminology used in this and other documents.
[UCD]	Unicode Character Database http://www.unicode.org/ucd/ For an overview of the Unicode Character Database and a list of its associated files see http://www.unicode.org/Public/UNIDATA/UCD.html
[Unicode]	The Unicode Standard, Version 4.0, (Reading, Massachusetts: Addison-Wesley Developers Press 2003, ISBN 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

The following summarizes modifications from the previous version of this document.

1	Initial version.

Copyright © 2005 Rick McGowan, Ken Whistler, and Unicode, Inc. All Rights Reserved. The Unicode Consortium and Rick McGowan & Ken Whistler make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.