|Authors||Asmus Freytag, Rick McGowan and Ken Whistler|
|Date||May 08, 2006|
This document provides information on many known anomalies in the formal character names in the Unicode Standard.
This document is a Unicode Technical Note. Sole responsibility for its contents rests with the author(s). Publication does not imply any endorsement by the Unicode Consortium. This document is not subject to the Unicode Patent Policy.
For information on Unicode Technical Notes including criteria for acceptance, see http://www.unicode.org/notes/.
In this document we list all Unicode character names with known clerical errors in the spelling of their names at the time of its writing. In addition, we have compiled information on many misnamed characters, misleading character names, and characters with other known problems with their names.
Because Unicode Standard is a character encoding standard and not the Universal Encyclopedia of Writing Systems and Character Identity, the stability and uniqueness of published character names is far more important than the correctness of the name. The published character names are normative for the purposes of the Unicode standard and the large number of other IT standards that reference it. These standards require stable identifiers and character names must therefore be immutable — any change of character names is almost as disruptive of the standards as changing code points for characters would be. Accordingly, the Unicode Consortium has adopted the Unicode Standard Stability Policy, preventing changes in character names. As a result, errors in character names cannot be corrected. Instead, important character name anomalies anomalies are documented with annotations in the Unicode Character Code Charts.
The requirement for a unique and stable character name that can be used as a formal identifier does not mean that the Unicode Standard dictates to anyone what the name of any given letter in their writing system should properly be, whether in English or in any other language. The Unicode Code Charts provide informative aliases for a large number of characters, the names of which are not anomalous or defective. This is because different user communities often use different names for the same character, even in English.
One of the reasons why the Unicode standard publishes many informative aliases in the Unicode names list is because there often are much better, more communicative names for particular characters, even in English than the normative names in the data file. For example, U+002F SOLIDUS is more widely known among its American users as slash. Informal aliases are useful in describing a character, but cannot be used as identifiers, because they are not guaranteed to be unique or stable. Users are free to use such aliases and other names, as long as they are not mis-represented as corrections to the standard, but instead used as alternative, more useful names for characters in the standard.
For character names that were encoded with misspelled words as part of their name, or that exhibit other serious errors, The Unicode Standard has adopted normative character name aliases. These aliases can be used as a alternative, normative identifier for the character without the need to preserve the original spelling or other error in the character name. While this means that some characters can have more than one identifier, each identifier continues to uniquely refer to a single character. Formal aliased are documented in the NameAliases.txt file in the Unicode Character Database. Formal name aliases also documented in the Unicode Code Charts. We have not documented them here, instead, we merely indicate for which characters formal aliases exist at the time of this writing.
In some cases, annotations have been added to the names list in the Unicode Standard to document various lesser problems, but to date there has been no full listing of all known problems.
The authors therefore intend this Technical Note to serve as a convenient summary of the information about character name anomalies in the Unicode Standard at the time of its writing. It will be updated from time to time as additional anomalies become known. While the information in this technical note is based on information published in the Unicode Standard, the selection and manner of presentation in this document reflect choices made by its authors; it does not in any way supersede the information in the Unicode Standard.
This section lists character names with known anomalies, including those for which a formal alias has been defined. It provides further information about some names that have been the objects of discussion or inquiry. As issues are reported, additional entries may be added at any time and without notice. While many of the explanations below are based on annotations in the Unicode Code charts, they have been edited or re-stated by the authors.
U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
- Even though this is encoded as single character, it is not usually considered a single letter.
U+01A2 LATIN CAPITAL LETTER OI
U+01A3 LATIN SMALL LETTER OI
- These should have been called letter GHA. They are neither pronounced 'oi' nor based on the letters 'o' and 'i'.
U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
- This is actually based on a ligation of "ts", not an inverted glottal stop.
U+0238 LATIN SMALL LETTER DB DIGRAPH
U+0239 LATIN SMALL LETTER QP DIGRAPH
- These are actually ligatures, rather than digraphs
U+025B LATIN SMALL LETTER OPEN E
- This is actually a Latin epsilon and should have been so called.
U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E
- Actually a closed reversed epsilon (reversed form of U+025B).
U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH
- This is actually a reversed fishhook r with retroflex hook.
U+030C COMBINING CARON
- The "caron" should have been called hacek and combining hacek. The term "caron" is suspected by some to be an invention of some early standards body, but it has also been claimed by others to have been in use at Linotype before the days of digital typography. Its true origin may be lost in the mists of time.
U+034F COMBINING GRAPHEME JOINER
- The name does not describe the function of this character. Despite its name, it does not join graphemes. For more infomation, see Section 7.9 Combining Marks, of the Unicode Standard.
U+039B GREEK CAPITAL LETTER LAMDA
U+03BB GREEK SMALL LETTER LAMDA
- The use of the spelling lamda derives from ISO 10646. This does not mean that it is more correct than lambda, merely that the spelling without the 'b' is the one used in the formal character names.
U+04A5 CYRILLIC SMALL LIGATURE EN GHE
U+04B5 CYRILLIC SMALL LIGATURE TE TSE
U+04D5 CYRILLIC SMALL LIGATURE A IE
- Despite their names, these are not decomposable ligatures.
U+0598 HEBREW ACCENT ZARQA
- Perhaps should have been called Hebrew accent tsinnorit. May also be used for zarqa when shown on accented non-final letter. See Appendix A.
U+05AE HEBREW ACCENT ZINOR
- Should have been called Hebrew accent zarqa (= tsinor). See Appendix A.
U+0670 ARABIC LETTER SUPERSCRIPT ALEF
- Not an Arabic letter, but a vowel sign.
U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
- These would have been better named ligature
U+0964 DEVANAGARI DANDA
U+0965 DEVANAGARI DOUBLE DANDA
- Despite the fact that these characters have "DEVANAGARI" in their names, these punctuation marks are intended for common use for the scripts of India.
U+0A01 GURMUKHI ADAK BINDI
- The spelling of the word Adak with a single 'd' is inconsistent with U+0A71 GURMUKHI ADDAK and should really have had two d's.
U+0B83 TAMIL SIGN VISARGA
- This character is the aaytham.
U+0CDE KANNADA LETTER FA
- There is no Kannada letter 'fa', this character represents the syllable 'llla'
U+0E9D LAO LETTER FO TAM
- The name for this character should have been fo sung, but that name is already used for U+0E9F. A formal alias LAO LETTER FO FON correcting this error has been defined.
U+0E9F LAO LETTER FO SUNG
- The name for this character should have been fo tam, but that name is already used for U+0E9D. A formal alias LAO LETTER FO FAY correcting this error has been defined.
U+0EA3 LAO LETTER LO LING
- The name for this character should have been lo loot, but that name is already used for U+0EA5. A formal alias LAO LETTER RO correcting this error has been defined.
U+0EA5 LAO LETTER LO LOOT
- The name for this character should have been lo ling, but that name is already used for U+0EA3. A formal alias LAO LETTER LO correcting this error has been defined.
U+0F0A TIBETAN MARK BKA- SHOG YIG MGO
- This character is used to indicate that a document is addressed to a superior (the "petition honorific"), but the Tibetan name actually indicates a superior addressing an inferior ("starting flourish for giving a command").
U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG
- The tsheg mark is not restricted to intersyllabic usage, and would have been better named Tibetan mark tsheg.
U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
- This character is not a delimiter, but is a non-breaking version of the tsheg mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the shad mark (U+0F0D).
U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
- The syllable "BSKA-" does not occur naturally in Tibetan, and is a mistake for "BKA-" (cf. U+0F0A). A formal alias correcting this error has been defined.
U+156F CANADIAN SYLLABICS TTH
- There is no 'tth' syllable. A better name would have been Canadian Syllabics asterisk.
U+178E KHMER LETTER NNO
- As this character belongs to the first register, its correct transliteration is nna, not NNO.
U+179E KHMER LETTER SSO
- As this character belongs to the first register, its correct transliteration is ssa, not SSO.
U+200B ZERO WIDTH SPACE
- This isn't a "space". It is an invisible character that can be used to provide line break opportunities.
U+2113 SCRIPT SMALL L
- Despite its character name, this symbol is derived from a special italicized version of the small letter "L".
U+2118 SCRIPT CAPITAL P
- Should have been called calligraphic small p or perhaps even Weierstrass elliptic function symbol, which is what it is used for. It's not a capital "P" at all.
U+262B FARSI SYMBOL
- This symbol is so named because as symbol of Iran it cannot be encoded in ISO standards.
U+3021 HANGZHOU NUMERAL ONE
U+3022 HANGZHOU NUMERAL TWO
U+3023 HANGZHOU NUMERAL THREE
U+3024 HANGZHOU NUMERAL FOUR
U+3025 HANGZHOU NUMERAL FIVE
U+3026 HANGZHOU NUMERAL SIX
U+3027 HANGZHOU NUMERAL SEVEN
U+3028 HANGZHOU NUMERAL EIGHT
U+3029 HANGZHOU NUMERAL NINE
- The Suzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms used by traders to display the prices of goods. The use of "HANGZHOU" in the names is a misnomer.
U+327C CIRCLED KOREAN CHARACTER CHAMKO
U+327D CIRCLED KOREAN CHARACTER JUEUI
- An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.
U+A015 YI SYLLABLE WU
- This is not a syllable pronounced "wu", but is actually a syllable iteration mark.
U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E
U+FA0F CJK COMPATIBILITY IDEOGRAPH-FA0F
U+FA11 CJK COMPATIBILITY IDEOGRAPH-FA11
U+FA13 CJK COMPATIBILITY IDEOGRAPH-FA13
U+FA14 CJK COMPATIBILITY IDEOGRAPH-FA14
U+FA1F CJK COMPATIBILITY IDEOGRAPH-FA1F
U+FA21 CJK COMPATIBILITY IDEOGRAPH-FA21
U+FA23 CJK COMPATIBILITY IDEOGRAPH-FA23
U+FA24 CJK COMPATIBILITY IDEOGRAPH-FA24
U+FA27 CJK COMPATIBILITY IDEOGRAPH-FA27
U+FA28 CJK COMPATIBILITY IDEOGRAPH-FA28
U+FA29 CJK COMPATIBILITY IDEOGRAPH-FA29
- These 12 characters are unified CJK ideographs, not compatibility ideographs, despite their names.
U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
- A spelling error: "brakcet" should be "bracket". A formal alias correcting this error has been defined.
U+FEFF ZERO WIDTH NO-BREAK SPACE
- Byte Order Mark (Naming it ZWNBSP was a mistake from the start.)
U+1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
- A spelling error: "fhtora" should be "fthora". A formal alias correcting this error has been defined.
There are two separate cantillation systems in the Hebrew Bible. One is used for Psalms, Proverbs and (most of) Job, (the "poetic" books, hence the "poetic system"), and the other is used everywhere else. The two systems have structural similarities and share some graphemes, but not all. In modern printing the accents have roughly the same shape; old manuscripts actually had them written slightly differently. In the prose system there is an accent called ZARQA, which is postposed (on or to the left of the last letter), and in the poetic system there is one called TSINOR (and also zarqa and vice-versa; each of these has many names) which has the same shape and placement and even an analogous function in the structure of the cantillations. There is another accent, only in the poetic system, called the TSINNORIT (a diminutive of tsinor), which occurs directly above its letter, and is (almost?) never on the last letter of its word. (More modern printing tends to put the zarqa right on top of its letter too, but that's just a printing preference). If you look closely at some old manuscripts, you can tell that tsinnorit has a slightly different shape than zarqa/tsinor.
As encoded in Unicode, there are ZARQA (U+0598) and ZINOR (U+05AE) [sic]. By the usual meanings of those names, those should properly be synonyms, the same accent, but they're not. While the word"zinor" would be mnemonic of "tsinnorit," it's the wrong way around in the character names: ZINOR has the combining class of above-postposed, and ZARQA is encoded to go directly above the letter. So, to encode a zarqa or a tsinor, you need to use ZINOR, and to encode a tsinnorit, you need to use ZARQA.
[contributed by: Mark Shoulson]
Thanks to John Hudson, James Kass, Mark Shoulson, and Andrew West for their contributions.
The following summarizes modifications from the previous version of this document.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.