[Unicode]   Unicode
 

L2/05-096

Known Anomalies in
Unicode Character Names

Version 1
Authors Rick McGowan and Ken Whistler
Date April 22, 2005
This Version n/a
Previous Version n/a
Latest Version n/a


Summary

This document provides a complete listing of all known typographical errors in Unicode character names. It also includes a selection of misnamed characters, misleading names, and other problems with character names. Due to stability policies, such errors in names cannot be corrected, but can cause difficulties to users and implementers.

Status

This document is a draft. It is supplied purely for informational purposes to the Unicode Technical Committee.

Contents


1 Introduction

Due to stability policies, errors in character names cannot be corrected. This policy dates from a famous ruckus in the early 1990s over the name of æ and Æ:

1993-07-08

Denmark is issuing this defect report to ISO 10646-1:1993 based on the naming of Danish, Faroese and Greenlandic letter "Æ" in upper and lower case and with acute accent. The character "Æ" is also used as letter in the Norwegian and Icelandic languages. Please find enclosed an official statement from the Danish Standards Association concerning the Danish letter "Æ". During the process of writing the ISO 10646-1:1993, the naming was correct - for example "LATIN CAPITAL LETTER AE" - in the second DIS. It was changed to "LATIN CAPITAL LIGATURE AE" in the final version of the ISO 10646-1 (1993). ...

This defect report took over two years to resolve, with Francophones and Scandinavians at loggerheads every step of the way, until DCOR No. 1 to 10646-1:1993 was published in 1996.

The Unicode Standard, being synchronized with 10646, was dragged along in this process:

Unicode 1.0:

U+00E6 LATIN SMALL LETTER A E
    = LATIN SMALL LETTER AE
    (the name in ISO 8859-1)

Unicode 1.1:

U+00E6 LATIN SMALL LIGATURE AE
    (synchronized with 10646-1:1993)
    = LATIN SMALL LETTER A E

Unicode 2.0:

U+00E6 LATIN SMALL LETTER AE
    (applied DCOR No. 1 to 10646-1:1993)
    = LATIN SMALL LIGATURE AE

The confusion this situation left in all of the standards documents from the 1993 - 1996 period had no beneficial consequences for implementations of æ and Æ. It left bitter feelings and led both committees to decide that past a certain point such defect reports would be noted but not acted upon, insofar as they were requests for changes in names of published characters in the standards.

The stability of published character names is far more important to the network of interdependent standards that refer to character encoding standards than is the correctness of the name.

People who are using the Unicode Standard must understand that it is a character encoding standard. It is not the Universal Encyclopedia of Writing Systems and Character Identity.

Unicode character names are normative for the purposes of the character encoding standard and those other IT standards that reference it. They are also immutable, by action of both SC2 and the UTC, because change of character names is almost as disruptive of the standards as changing code points for characters would be.

This does not mean that the Unicode Standard is dictating to anyone what the name of some letter in their writing system should properly be, whether in English or in any other language.

This should be obvious from ASCII characters, which have a long history of this kind of concern, well predating Unicode's involvement in character encoding. Take U+002F SOLIDUS. Not one American English speaker in 100,000 would call '/' a "solidus". Its name is "slash" or for older speakers, perhaps "slanted bar", and so forth. Use the term "solidus" and everyone will look blankly at you, except Classics professors wondering what Roman money has to do with it, or character encoding mavens, who know the term because they read ASCII code charts.

In some cases, annotations have been added to the names list in the Unicode Standard to document various problems, but to date there has been no full listing of all known problems. One of the reasons why the Unicode standard publishes many aliases in the Unicode names list is because there often are much better, more communicative names for particular characters, even in English than the normative names in the data file. Users are free to use such aliases and other names, as long as they are not mis-represented as corrections to the standard, but instead used as alternative, more useful names for characters in the standard.

2 List of Known Anomalies and Their Explanations

This section lists known problems with Unicode character names.

U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE

U+01A2 LATIN CAPITAL LETTER OI
U+01A3 LATIN SMALL LETTER OI

U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE

U+0238 LATIN SMALL LETTER DB DIGRAPH
U+0239 LATIN SMALL LETTER QP DIGRAPH

U+025B LATIN SMALL LETTER OPEN E

U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E

U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH

U+02C7 CARON
U+030C COMBINING CARON

U+034F COMBINING GRAPHEME JOINER

U+0598 HEBREW ACCENT ZARQA

U+05AE HEBREW ACCENT ZINOR

U+0670 ARABIC LETTER SUPERSCRIPT ALEF

U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

U+0B83 TAMIL SIGN VISARGA

U+0CDE KANNADA LETTER FA

U+0F0A TIBETAN MARK BKA- SHOG YIG MGO

U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG

U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR

U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN

U+156F CANADIAN SYLLABICS TTH

U+200B ZERO WIDTH SPACE

U+2118 SCRIPT CAPITAL P

U+262B FARSI SYMBOL

U+327C CIRCLED KOREAN CHARACTER CHAMKO
U+327D CIRCLED KOREAN CHARACTER JUEUI

U+A015 YI SYLLABLE WU

U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET

U+FEFF ZERO WIDTH NO-BREAK SPACE

U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

 

Appendix A: Zarqa and Zinor

OK, here's the rundown: There are two separate cantillation systems in the Hebrew Bible. One is used for Psalms, Proverbs and (most of) Job, (the "poetic" books, hence the "poetic system"), and one is used everywhere else. The two systems have structural similarities and share *some* graphemes, but not all. We are here dealing with accents with roughly the same shape (in modern printing; old manuscripts actually had them written slightly differently). In the prose system there is an accent called ZARQA, which is postposed (on or to the left of the last letter), and in the poetic system there is one called TSINOR (and also zarqa and vice-versa; each of these has many names) which has the same shape and placement and even an analogous function in the structure of the cantillations. There is another accent, only in the poetic system, called the TSINNORIT (a diminutive of tsinor), which occurs directly above its letter, and is (almost?) never on the last letter of its word. (More modern printing tends to put the zarqa right on top of its letter too, but that's just a printing preference). If you look closely at some old manuscripts, you can tell that tsinnorit has a slightly different shape than zarqa/tsinor.

In Unicode terms, we have ZARQA (U+0598) and ZINOR (U+05AE) [sic]. By the usual meanings of those names, those should properly be synonyms, the same accent, but they're not. Worse, even if we work with the fact that "zinor" is at least mnemonic of "tsinnorit," it's still the wrong way 'round: ZINOR has the combining class of above-postposed, and ZARQA is encoded to go directly above the letter. So, to encode a zarqa or a tsinor, you need to use ZINOR, and to encode a tsinnorit, you need to use ZARQA. Oooops. That's less than obvious.

-- Mark Shoulson

Acknowledgements

Thanks to Mark Shoulson, John Hudson, James Kass, and Andrew West for their contributions.

References

[10646] International Organization for Standardization. Information Technology— Universal Multiple-Octet Coded Character Set (UCS).  (ISO/IEC 10646:2003).
For availability see http://www.iso.org/
[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Feedback] http://www.unicode.org/reporting.html
For reporting errors and requesting information online.
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Names] Unicode Names List Documentation and Format http://www.unicode.org/Public/UNIDATA/NamesList.html
For the latest data file see http://www.unicode.org/Public/UNIDATA/NamesList.txt
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Stability] Unicode Standard Stability Policy
http://www.unicode.org/standard/stability_policy.html
For explanations of terminology used in this and other documents.
[UCD] Unicode Character Database http://www.unicode.org/ucd/
For an overview of the Unicode Character Database and a list of its associated files see http://www.unicode.org/Public/UNIDATA/UCD.html
[Unicode] The Unicode Standard, Version 4.0, (Reading, Massachusetts: Addison-Wesley Developers Press 2003, ISBN 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/
[Versions] Versions of the Unicode Standard
http://www.unicode.org/versions/
For details on the precise contents of each version of the Unicode Standard, and how to cite them.


Modifications

The following summarizes modifications from the previous version of this document.

1 Initial version.