[Unicode]   Technical Notes
 

Unicode Technical Note #27

Known Anomalies in
Unicode Character Names

Version 4
Authors Asmus Freytag, Rick McGowan, and Ken Whistler
Date April 10, 2017
This Version http://www.unicode.org/notes/tn27/tn27-4.html
Previous Version http://www.unicode.org/notes/tn27/tn27-3.html
Latest Version http://www.unicode.org/notes/tn27/

Summary

This document provides information on many known anomalies in the formal character names in the Unicode Standard.

Status

This document is a Unicode Technical Note. Sole responsibility for its contents rests with the author(s). Publication does not imply any endorsement by the Unicode Consortium. This document is not subject to the Unicode Patent Policy.

For information on Unicode Technical Notes including criteria for acceptance, see http://www.unicode.org/notes/.


Introduction

In this document we list all Unicode character names with known clerical errors in the spelling of their names at the time of its writing. In addition, we have compiled information on many misnamed characters, misleading character names, and characters with other known problems with their names.

Because Unicode Standard is a character encoding standard and not the Universal Encyclopedia of Writing Systems and Character Identity, the stability and uniqueness of published character names is far more important than the correctness of the name. The published character names are normative for the purposes of the Unicode standard and the large number of other IT standards that reference it. These standards require stable identifiers and character names must therefore be immutable — any change of character names is almost as disruptive of the standards as changing code points for characters would be. Accordingly, the Unicode Consortium has adopted the Name Stability Policy, preventing changes in character names. As a result, errors in character names cannot be corrected. Instead, important character name anomalies anomalies are documented with annotations in the Unicode Character Code Charts.

The requirement for a unique and stable character name that can be used as a formal identifier does not mean that the Unicode Standard dictates to anyone what the name of any given letter in their writing system should properly be, whether in English or in any other language. The Unicode Code Charts provide informative aliases for a large number of characters, the names of which are not anomalous or defective. This is because different user communities often use different names for the same character, even in English.

One of the reasons why the Unicode standard publishes many informative aliases in the Unicode names list is because there often are much better, more communicative names for particular characters, even in English than the normative names in the data file. For example, U+002F SOLIDUS is more widely known among its American users as slash. Informal aliases are useful in describing a character, but cannot be used as identifiers, because they are not guaranteed to be unique or stable. Users are free to use such aliases and other names, as long as they are not mis-represented as corrections to the standard, but instead used as alternative, more useful names for characters in the standard.

For character names that were encoded with misspelled words as part of their name, or that exhibit other serious errors, The Unicode Standard has adopted normative character name aliases. These aliases can be used as a alternative, normative identifier for the character without the need to preserve the original spelling or other error in the character name. While this means that some characters can have more than one identifier, each identifier continues to uniquely refer to a single character. Formal aliases are documented in the NameAliases.txt file in the Unicode Character Database. Formal name aliases also documented in the Unicode Code Charts. We have not documented them here, instead, we merely indicate for which characters formal aliases exist at the time of this writing.

In some cases, annotations have been added to the names list in the Unicode Standard to document various lesser problems, but to date there has been no full listing of all known problems.

The authors therefore intend this Technical Note to serve as a convenient summary of the information about character name anomalies in the Unicode Standard at the time of its writing. It will be updated from time to time as additional anomalies become known. While the information in this technical note is based on information published in the Unicode Standard, the selection and manner of presentation in this document reflect choices made by its authors; it does not in any way supersede the information in the Unicode Standard.

List of Known Anomalies and Explanations

This section lists character names with known anomalies, including those for which a formal alias has been defined. It provides further information about some names that have been the objects of discussion or inquiry. As issues are reported, additional entries may be added at any time and without notice. While many of the explanations below are based on annotations in the Unicode Code charts, they have been edited or re-stated by the authors.

U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE

U+01A2 LATIN CAPITAL LETTER OI
U+01A3 LATIN SMALL LETTER OI

U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE

U+0238 LATIN SMALL LETTER DB DIGRAPH
U+0239 LATIN SMALL LETTER QP DIGRAPH

U+025B LATIN SMALL LETTER OPEN E

U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E

U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH

U+02C7 CARON
U+030C COMBINING CARON

U+034F COMBINING GRAPHEME JOINER

U+039B GREEK CAPITAL LETTER LAMDA
U+03BB GREEK SMALL LETTER LAMDA

U+04A5 CYRILLIC SMALL LIGATURE EN GHE
U+04B5 CYRILLIC SMALL LIGATURE TE TSE
U+04D5 CYRILLIC SMALL LIGATURE A IE

U+0598 HEBREW ACCENT ZARQA

U+05AE HEBREW ACCENT ZINOR

U+0670 ARABIC LETTER SUPERSCRIPT ALEF

U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

U+0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT

U+0964 DEVANAGARI DANDA
U+0965 DEVANAGARI DOUBLE DANDA

U+0A01 GURMUKHI SIGN ADAK BINDI

U+0B83 TAMIL SIGN VISARGA

U+0CDE KANNADA LETTER FA

U+0E9D LAO LETTER FO TAM

U+0E9F LAO LETTER FO SUNG

U+0EA3 LAO LETTER LO LING

U+0EA5 LAO LETTER LO LOOT

U+0F0A TIBETAN MARK BKA- SHOG YIG MGO

U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG

U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR

U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN

U+11EC HANGUL JONGSEONG IEUNG-KIYEOK
U+11ED HANGUL JONGSEONG IEUNG-SSANGKIYEOK
U+11EE HANGUL JONGSEONG SSANGIEUNG
U+11EF HANGUL JONGSEONG IEUNG-KHIEUKH

U+156F CANADIAN SYLLABICS TTH

U+178E KHMER LETTER NNO

U+179E KHMER LETTER SSO

U+200B ZERO WIDTH SPACE

U+2113 SCRIPT SMALL L

U+2118 SCRIPT CAPITAL P

U+234A APL FUNCTIONAL SYMBOL DOWN TACK UNDERBAR
U+234E APL FUNCTIONAL SYMBOL DOWN TACK JOT
U+2351 APL FUNCTIONAL SYMBOL UP TACK OVERBAR
U+2355 APL FUNCTIONAL SYMBOL UP TACK JOT
U+2361 APL FUNCTIONAL SYMBOL UP TACK DIAERESIS

U+2448 OCR DASH
U+2449 OCR CUSTOMER ACCOUNT NUMBER

U+2629 CROSS OF JERUSALEM

U+262B FARSI SYMBOL

U+2B7A LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE
U+2B7C RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE HORIZONTAL STROKE

U+3021 HANGZHOU NUMERAL ONE
U+3022 HANGZHOU NUMERAL TWO
U+3023 HANGZHOU NUMERAL THREE
U+3024 HANGZHOU NUMERAL FOUR
U+3025 HANGZHOU NUMERAL FIVE
U+3026 HANGZHOU NUMERAL SIX
U+3027 HANGZHOU NUMERAL SEVEN
U+3028 HANGZHOU NUMERAL EIGHT
U+3029 HANGZHOU NUMERAL NINE

U+327C CIRCLED KOREAN CHARACTER CHAMKO
U+327D CIRCLED KOREAN CHARACTER JUEUI

U+A015 YI SYLLABLE WU

U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E
U+FA0F CJK COMPATIBILITY IDEOGRAPH-FA0F
U+FA11 CJK COMPATIBILITY IDEOGRAPH-FA11
U+FA13 CJK COMPATIBILITY IDEOGRAPH-FA13
U+FA14 CJK COMPATIBILITY IDEOGRAPH-FA14
U+FA1F CJK COMPATIBILITY IDEOGRAPH-FA1F
U+FA21 CJK COMPATIBILITY IDEOGRAPH-FA21
U+FA23 CJK COMPATIBILITY IDEOGRAPH-FA23
U+FA24 CJK COMPATIBILITY IDEOGRAPH-FA24
U+FA27 CJK COMPATIBILITY IDEOGRAPH-FA27
U+FA28 CJK COMPATIBILITY IDEOGRAPH-FA28
U+FA29 CJK COMPATIBILITY IDEOGRAPH-FA29

U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET

U+FEFF ZERO WIDTH NO-BREAK SPACE

U+122D4 CUNEIFORM SIGN SHIR TENU
U+122D5 CUNEIFORM SIGN SHIR OVER SHIR BUR OVER BUR

U+1B001 HIRAGANA LETTER ARCHAIC YE

U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

U+1D300 MONOGRAM FOR EARTH
U+1D301 DIGRAM FOR HEAVENLY EARTH
U+1D302 DIGRAM FOR HUMAN EARTH
U+1D303 DIGRAM FOR EARTHLY HEAVEN
U+1D304 DIGRAM FOR EARTHLY HUMAN
U+1D305 DIGRAM FOR EARTH

U+1D300 MONOGRAM FOR HUMAN
U+1D301 DIGRAM FOR HEAVENLY HUMAN
U+1D302 DIGRAM FOR EARTHLY HUMAN
U+1D303 DIGRAM FOR HUMANLY HEAVEN
U+1D304 DIGRAM FOR HUMANLY EARTH
U+1D305 DIGRAM FOR HUMANLY HUMAN

 

Appendix A: Notes on Zarqa and Zinor

There are two separate cantillation systems in the Hebrew Bible. One is used for Psalms, Proverbs and (most of) Job, (the "poetic" books, hence the "poetic system"), and the other is used everywhere else. The two systems have structural similarities and share some graphemes, but not all. In modern printing the accents have roughly the same shape; old manuscripts actually had them written slightly differently. In the prose system there is an accent called ZARQA, which is postposed (on or to the left of the last letter), and in the poetic system there is one called TSINOR (and also zarqa and vice-versa; each of these has many names) which has the same shape and placement and even an analogous function in the structure of the cantillations. There is another accent, only in the poetic system, called the TSINNORIT (a diminutive of tsinor), which occurs directly above its letter, and is (almost?) never on the last letter of its word. (More modern printing tends to put the zarqa right on top of its letter too, but that's just a printing preference). If you look closely at some old manuscripts, you can tell that tsinnorit has a slightly different shape than zarqa/tsinor.

As encoded in Unicode, there are ZARQA (U+0598) and ZINOR (U+05AE) [sic]. By the usual meanings of those names, those should properly be synonyms, the same accent, but they're not. While the word"zinor" would be mnemonic of "tsinnorit," it's the wrong way around in the character names: ZINOR has the combining class of above-postposed, and ZARQA is encoded to go directly above the letter. So, to encode a zarqa or a tsinor, you need to use ZINOR, and to encode a tsinnorit, you need to use ZARQA.

[contributed by: Mark Shoulson]

Acknowledgements

Thanks to John Hudson, James Kass, KAWABATA Taichi, Marc Lodewijck, Artur Q.A., Mark Shoulson, and Andrew West for their contributions.

Modifications

The following summarizes modifications from the previous version of this document.

Revision 4

Revision 3

Revision 2

Revision 1