From portal!cup.portal.com!James_T_Caldwell@Sun.COM Tue Oct 9 08:25:05 1990 From: portal!cup.portal.com!James_T_Caldwell@Sun.COM To: asumusf%microsoft@Sun.COM, u-core%noddy@Sun.COM Subject: bloks Date: Tue, 9 Oct 90 00:19:51 PDT Unicore, Here again is the blocks list for comment and revision. It has bounced twice, costing a week, from u-core@sun.com and u-core@noddy.sun.com Hope this works. CHARACTER BLOCKS AND BLOCK INTRODUCTIONS The first 256 Unicodes may be considered as a group, although they fall into four distinct sub- blocks: U+0000-001F: C0 ASCII control codes U+0020-007F: ASCII graphic characters U+0080-009F: C1 control codes U+00A0-00FF: ISO 8859/1 (aka Latin1)Standards: Unicode adapts the ISO standards for 7-bit and 8-bit characters by retaining the semantics and numeric values of these codes, merely supplying enough leading zeroes to convert them into 16- bit numbers. In terms of a 16-bit space, the content and arrangement of these standards is far from optimal, but Unicode retains them without change because of their prevalence in existing usage. ISO 646: The ISO character encoding standards are founded on 646, "7-bit coded character set for information processing interchange". This provides an "international" set of interpretations for numeric values U+0000-007F, which are intended to be "localized" into national standard codes. ANSI X3.4-1977: This is ASCII: "American National Standard Code for Information Interchange". ASCII is the version of ISO 646 "localized" into the national standard code for the USA. In the few places where ISO 646 and ASCII may differ, Unicode gives priority to the "specific" interpretations of ASCII rather than to the "generic" interpretations of ISO 646. (For example, at code U+0024, ISO 646 has the generic "international currency symbol", whereas in ASCII and Unicode this is localized to the dollar sign.) The principle is to have explicit unambiguous character codes. This code is assigned to the American dollar sign because it is so in ASCII. Other currency symbols are likewise given their own code points within the appropriate blocks. ISO 8859/1: Also known as "Latin1", this extension is intended to supply a "most broadly useful" 8-bit complement to ISO 646, by providing additional letters extending the Latin alphabet to cover certain major languages of Europe (listed below). ASCII (C0) Control Codes U+0000-001F C0 ASCII control codes: The role of "control codes" in Unicode is discussed elsewhere. Unicode makes no particular use of these control codes, but merely provides for the passage of the numeric code values intact, neither adding to nor subtracting from their semantics. ASCII Characters U+0020-007F ASCII graphic characters: (Technically, codes U+0020 SPACE and U+007F DELETE are control codes, the remaining 94 codes in this range are graphic characters.) Some of the non-letter characters in this range suffer from overburdened usage as a result of the limited number of codes in a 7-bit space. Some coding consequences of this are discussed below under "Semantic vs. glyphic encoding" and "Loose vs. precise semantics". The rather haphazard ASCII collection of punctuation and mathematical signs are isolated from the larger body of Unicode punctuation, signs, and symbols (which are encoded in ranges starting at U+2000) only because the relative locations within ASCII are so widely used in standards and software. Latin1 (C1) Control Codes U+00A0-009F C1 control codes: "Control codes" in the C1 range are assigned interpretations in various ISO standards, but do not have the force of long established usage as do those in the C0 range. Whatever the eventual assignments in the C1 range may be, Unicode makes no particular use of them, it merely provides for the passage of the numeric code values intact. Latin1 Characters U+00A0-00FF ISO 8859/1 (aka Latin1): Unicode specifies that combinations of a base letter plus a diacritical mark be coded out as two separate character codes. However,because ISO 8859, in serving those who desire to assign single codes for the most commonly used baseform-mark combinations, Unicode offers separate codepoints for these composed characters, treating them as if they were single characters. Unfortunately, this engenders multiple spellings single constructs, introducing ambiguity for users. Therefore, even though these composite characters are included and can be used, pure Unicode implementations will code the diacritics separately. The languages that were formally targeted for coverage by extended Latin ISO 8859/1 (which supplements the Latin characters in ISO 646) are: Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Many other languages can be written with this set of letters, including Hawaiian, Indonesian/Malay, and Swahili. The characters within this group that have relatively limited use are annotated with the major language(s) employing them. The characters in ISO 8859/2, 8859/3, and 8859/4 (additional Extended Latin characters) are encoded in the following Unicode block. Like ASCII, the Latin1 set includes a rather miscellaneous set of punctuation and mathematical signs. Punctuation, signs, and symbols, not included in ASCII and ISO-8859 are encoded in Unicode addresses starting at U+2000. "Diacritical mark" characters: ASCII contains four codes which it treats as potential diacritical marks: U+005E, U+005F, U+0060, U+007E; Latin1 contains five such codes: U+00A8, U+00AF, U+00B0, U+00B4, U+00B8. In Unicode, these codepoints are unambiguously restricted to use as spacing characters; the corresponding non-spacing characters are coded elsewhere and cross- referenced. Semantic vs. glyphic encoding: Because the numeric code values in this range are well- established and widely used in various implementations, Unicode assigns minimal specifications on the typographic appearance of corresponding glyphs. For example, the value ASCII 0024 has the semantic "dollar sign" in the US, leaving open the question of whether the dollar sign is to be rendered with one vertical stroke or two. Thus, this Unicode value is taken to refer to the identity the "dollar sign" semantic, not to its precise appearance. Thus, for the codes in this range that are indicated with alternative glyphs, the code is associated with the basic usage, and different systems are free to present the particular graphical form of their choice. Loose vs. precise semantics: Some ASCII characters have multiple uses, either through ambiguity in the original standards or through accumulated reinterpretations of a limited codeset. For example, U+0027 is defined in ANSI X3.4-1977 as "Apostrophe (Closing Single Quotation Mark; Acute Accent)", and U+002D as "Hyphen (Minus)". In general, Unicode intends merely to provide for the passage of these numeric code values intact, without adding to or subtracting from their semantics. Unicode supplies unambiguous codes elsewhere for the most useful particular interpretations of these ASCII values, and the corresponding unambiguous characters are cross-referenced. In a very few cases, Unicode indicates a preferred interpretation of an ASCII code, e.g. U+0027 is intended to be neutral (vertical) . Author accepting responsibility is :___Joe Becker__________________ European Latin U+0100-017E Extended Latin U+0180-024F The "Extended Latin" block is provided as a grab-bag of letterforms used to extend the Latin script for non-European languages, phonetic symbols (other than the standard International Phonetic Alphabet symbols in the following block), or other special uses. Standards: This block covers, among other things, a registered standard for graphic characters used by African languages, ISO 6438 = German Standard DIN 31625, plus "Pinyin" Latin transcription characters in the People's Republic of China national standard GB 2312-80. Encoding structure: The Unicode block for Extended Latin is divided into the following ranges: U+0180-01C3: Extended Latin U+01C4-01CC: Croatian digraphs matching Serbian Cyrillic letters U+01CD-01D4: Pinyin diacritic-vowel combinations U+01D5-024F: Additions and currently unassigned Extended Latin: This group is merely a union of forms collected from a variety of different sources (the single greatest source is ISO 6438). The forms are arranged in approximate Latin alphabetical order. Upper/lower case pairs are placed together where possible, but in many cases the other case forms is encoded at some rather distant location, and so is cross-referenced. The arrangement is not particularly defensible, but for different variations on the same base letter, the order is as follows: turned, inverted, hook attachment, stroke extension or modification, different style (e.g. script), small cap, modified basic form, ligature, greek- derived. A small collection of marginally-Latin forms concludes this group. Croatian digraphs matching Serbian Cyrillic letters: Unicode generally avoids encoding digraphs and other multiple letterforms, but an exception is made for the case of Serbocroatian, which is a single language with paired alphabets in Latin script (Croatian) and Cyrillic script (Serbian). In this unique case, direct one-to-one character transliteration is a reasonable ideal, and this set of digraph codes is provided for this purpose. The appropriate cross-references are given for the lowercase letters. One problem with digraph codes is that there are two potential uppercase forms, depending on whether only the initial letter is to be capitalized, or both (for the case of all-caps). Unicode does not in itself aim to provide any solution of this problem for software that transliterates between Croatian and Serbian. Pinyin diacritic-vowel combinations: The PRC standard GB 2312-80 provides a set of codes for the "Pinyin" Latin transcription of Mandarin Chinese. Most of the letters used in Pinyin romanization, even those with diacritical marks, are already covered in the preceding Latin blocks. The rather exceptional group of eight codes here is provided in order to cover the remaining Pinyin combinations specified in GB 2312-80. Additions and currently unassigned: The unassigned space for this block is made unusually large, on the supposition that the Latin script is the most widely used in the world, and hence will be subject to the most extensions for various purposes in the future. Case pairs: A number of characters in this block are uppercase forms of characters whose lowercase form is included in some other grouping. Most often, this occurs with characters that originated as members of the International Phonetic Alphabet which, when adopted into the Latin-based script of a real language, acquire a novel uppercase form. Occasionally alternative uppercase forms arise by this process. If usage information indicates that two uppercase forms are merely minor glyphic variants of the same form, they are given a single code, as for U+01B7 LATIN CAPITAL LETTER YOGH . If usage information indicates that two uppercase forms are acutally used differentially, then they are given dfferent codes, as for U+018E LATIN CAPITAL LETTER TURNED-E vs. U+018F LATIN CAPITAL LETTER SCHWA. In the latter event, the lowercase form is cloned U+01D5 LATIN SMALL LETTER TURNED E, clone of U+0259 LATIN SMALL LETTER SCHWA), so as to enable unique case-pair mappings if desired. Languages: Some indication of language or other usage is given for most characters, but this information is by no means to be regarded as exclusive. Author accepting responsibility is :____Joe Becker_________________ Standard Phonetic U+0250-02AF The "Standard Phonetic" block contains primarily the unique symbols of the International Phonetic Alphabet (IPA), which is a standard system for indicating specific sounds. The IPA was first propounded in 1886, and has undergone occasional revision of content and of usage since that time. Unicode covers all single symbols and all non-ligature alternates in the last published IPA revision (1979). The use of diacritical marks for close phonetic transcription is an integral part of IPA, as is the use of small "modifier letters" (the IPA diacritics and modifiers are encoded in the two blocks following this one). A few symbols have been added to this block that are peculiar to IPA-derived transcriptional practices among Sinologists, Americanists, et al. Note also that a few non-standard or obsolete phonetic symbols are encoded in the block preceding this one (Extended Latin). Unifications: IPA includes the entire lowercase Latin alphabet a-z, a number of extended Latin letters (e.g. U+0153 LATIN SMALL LETTER COMBINATION O E), and a few Greek letters. The question of whether these characters are "the same" when used in an IPA context, or whether all IPA forms should be considered as a separate unique alphabet, has many reasonable arguments on both sides. Ultimately, Unicode was designed so as to unify the IPA symbols as much as possible with other letters (although not with non-letter symbols such as U+222B INTEGRAL SIGN or U+2299 DIRECT PRODUCT). A primary reason, aside from reduced duplication, is that the IPA symbols have become adopted into Latin scripts for many languages (e.g. in Africa). There seems to be no merit in the futile attempt to distinguish a "transcription" from an "actual script" in such cases. The result is that several IPA symbols are found in ranges other than this block. Apart from the Latin alphabet, these are cross-referenced at the beginning of the names list. IPA alternates: In a few cases where standard IPA practice has evolved alternate forms, e.g., U+0269 SCRIPT-I "i" versus U+026A SMALL CAP -I "I", Unicode provides separate encodings for the two alternates. Case pairs: IPA does not sanction case distinctions, so in effect its phonetic symbols are all lowercase. When IPA symbols are adopted into the "actual script" of a language, as for example has occurred in Africa, they acquire uppercase forms. Since these uppercase forms are not themselves IPA symbols, they are encoded in the block preceding this one (Extended Latin), and cross-referenced with the IPA names list. Typographic variants: IPA includes typographic variants of certain Latin letters, which would ordinarily be considered variations of font style rather than of character identity, e.g. "script" or "small cap" letterforms. These forms are encoded as separate characters in Unicode so that all of IPA may be encompassed within a single font. Unicode also separately encodes the unique IPA typographic variant of the Greek letter "phi", as well as the borrowed letter Greek "iota" which has a unique Latin uppercase form. Diacritical marks: Unicode presumes the necessity of dynamically-applied (so-called "floating") diacritical marks, which happen to be an essential element of IPA orthography. In Unicode, all diacritical marks are encoded in sequence after the base character to which they apply. For more details, see the section on diacritical marks. Standards: 2nd DP ISO 10646: Unicode covers the phonetic characters contained in 2nd DP ISO 10646 (which are taken from the Xerox Character Code Standard). The Xerox/10646 set considers IPA forms to be a separate alphabet, so the Latin alphabet a-z and other symbols are duplicated there. Although 10646 rejects the use of applied diacritical marks for Latin letters, it provides such marks for the equivalent letters in IPA. Unicode diacritics are for general application; Unicode does not duplicate the Latin alphabet a-z in IPA. Encoding structure: The Standard Phonetic block is arranged in approximate alphabetical order according to the Latin letter that is graphically most similar to each symbol. This has nothing to do with a phonetic arrangement. Author accepting responsibility is :___Ken Whistler__________________ Modifier Letters U+02B0-02FF Modifier Letters are an assorted collection of small signs that are used generally to indicate modifications of the preceding letter, although a few may modify the following letter, and some may serve as independent letters. These signs are distinguished from "diacritical marks" in that modifier letters are treated as free-standing spacing characters. They are distinguished from similar- or identical- appearing punctuation or symbols by the fact that the members of this block are considered to be letter characters that do not break up a word. The majority of these signs are phonetic modifiers, including the requirements of the International Phonetic Alphabet (IPA). Phonetic usage: In phonetic usage these signs are sometimes called "diacritics", which is correct in the logical sense that they are modifiers of the preceding letter. However, in Unicode, the term "diacritical marks" refers specifically to non-spacing applied marks, whereas the codes in the current block specify spacing characters. For this reason, many of the "Modifier Letters" in this block correspond to separate "diacritical mark" codes which are cross-referenced in the names list. Modifier letters have relatively well-defined phonetic interpretations. Their usage is generally to indicate a specific articulatory modification of a sound represented by another letter, or to convey a particular level of stress or tone, etc. The modifier letters in Unicode are collated from a variety of sources, the most important of which is the IPA. Glyphic encoding: Despite Unicode's general policy of encoding characters, not glyphs, Unicode takes a "glyphic" approach to encoding the Modifier Letters. In this character set there exist different characters for the same "semantic", and there exist different "semantics" attributed to the same character in different contexts. For example, the signs U+02BC, U+02BE, U+02C0 have all been used in various publications as a Latin transliteration of the glottal stop (Arabic "hamza"), while at least U+02BC has other usages as well. The intention of the Unicode encoding is not to resolve the variations in usage, but merely to supply implementors with a set of useful forms to choose from. The list of usages given for each character should not be considered exhaustive. Encoding structure: The Unicode block for Modifier Letters is divided into the following relatively arbitrary ranges: U+02B0-02B8: Phonetic modifiers derived from Latin letters U+02B9-02D7: Miscellaneous phonetic modifiers U+02D8-02DB: Spacing clones of diacritics Latin superscripts: Graphically, some of the phonetic modifier signs are raised or superscripted, some are lowered or subscripted, and some are vertically centered. The raised signs that derive from Latin letters might suggest the superscripting of the entire Latin alphabet, but the intention here is to encode only those few forms that have specific usage in IPA or other major phonetic systems. Unicode does not in general provide separate codes for superscripted or subscripted characters (although exception is also made for a limited set of numeric forms to preserve one-to-one mapping with other prominent standards). Spacing clones of diacritics: Some corporate standards distinguish spacing and non-spacing forms of diacritical accent marks, and Unicode provides matching codes for these interpretations when practical. The majority of the spacing forms are covered in Unicode block "Latin1" (derived from ISO 8859/1). The four common European diacritics which do not have encodings in ISO 8859/1 are added as spacing characters in the current block. Since the encoding in this block is glyphic, these forms may be used with any suitable interpretation (e.g. U+02D9 SPACING DOT ABOVE as an indicator of Mandarin Chinese fifth tone). Author accepting responsibility is :______Ken Whistler_______________ Generic Diacritical Marks U+0300-03FF The application of "Diacritical Marks" constitutes the fundamental extension mechanism for the Greek family of scripts (preeminently Latin, Cyrillic, and Greek). The diacritical marks in this block are intended for generic use with any of these scripts, or even more generically, with any script if desired. In addition to the marks in this block, other diacritics specific to some particular script are encoded along with the alphabet for that script. Another block of diacritical marks, primarily used with symbols, is defined in code range U+20D0-20FF. The allocation of a diacritic to one block or another is merely a matter of perceived appropriateness; it is not intended to define or limit the range of characters to which a particular mark may be applied. Semantics of the "Diacritic" character property: The annotation of a Unicode character as a "Diacritic" (or its occurrence in the present block), and its depiction with relation to a dashed circle, constitute an assertion that this character is intended to be applied via some process to an associated character called the "base character" or "baseform". When rendered, the diacritical marks characters are intended to be attached to the preceding base character in some manner, and not to occupy a spacing position by themselves. These marks may therefore be called "non- spacing" or "floating" marks. Marks as spacing characters: By convention, Unicode diacritical marks may be exhibited in (apparent) isolation by applying them to the SPACE character U+0020. Also, Unicode separately encodes clones of most diacritical marks that are spacing characters, largely to provide compatibility with existing character sets. These related characters are cross- referenced. Sequence order of base character and diacritcal mark: In Unicode, all diacritcal marks are intended to be encoded in sequence ***after*** the base characters. Please note that this convention is different from the convention in standard ISO 6937 and other old standards. The Unicode sequence U+0061 "a" , U+0308 "(", U+0075 "u" unambiguously encodes " ", not "|". The reason for the old convention was conformity with "dead keys" on mechanical typewriters, which is no longer a consideration for computers. The reason for the Unicode convention is consistency with the logical order of vowel "points" in Semitic and Indic scripts. In those scripts, diacritics logically follow their base characters. Sequence order of multiple diacritcal marks: In case of multiple diacritcal marks applied to the same base character, if the result is unambiguous there is no reason to specify a sequence order for the mark characters. In the relatively rare cases where a standard sequence order of multiple marks is necessary, that order should be left-to-right, inside-outward. Double diacritics: A few marks are depicted with two dashed circles; such marks apply to the two characters preceding them in the text stream. Spelling of marked combinations: Since Unicode contains codes U+0075 "u", U+0308 "DIAERESIS," and also U+00FC "U WITH DIAERESIS", there are potentially two distinct sequences that both spell the letter "|" U WITH DIAERESIS. The same problem exists for several dozen other Latin baseform-diacritic combinations. Unicode recognizes that it is futile to prohibit the formation or transmission of any sequence of characters. The only workable solution is to require that any system or application desiring to enforce a standard spelling convention filter its own input stream. Since only a relatively small number of marked combination letters have independent Unicodes (for backward compatibility, see introduction to Latin1), requiring such filters should not pose major problems. Standards: The handling of diacritcal marks is currently a hotly-debated issue among different standards groups, since every potential solution has a high cost to some portion of the computing community. Diacritcal marks are treated with a great deal of inconsistency among current standards and even within some standards. The Unicode solution recognizes the fundamental necessity of "floating" diacritics, and for consistency encourages the treatment of all diacritics as floating. At the same time it provides for compatibility mappings with the major standards that have adopted other solutions. It should be repeated, however, that the Unicode sequence order of base-character-preceding-diacritcal-mark is different from the convention in ISO 6937 and others to reduce ambiguity through greater consistency. Glyphic encoding: Because the generic diacritical marks have such a wide variety of applications, the encoding in this block is intentionally "glyphic" rather than "semantic". Thus, there are cases of several different semantics for the same Unicode, e.g. U+0308 "= diaeresis = umlaut = double derivative. And there are cases of several different Unicodes for the same semantic, e.g., variants of "cedilla" include at least U+0312, U+0326 , and U+0327. Some diacritical marks are applied across the body of the base character; Unicode is more liberal about assigning independent codes to combination letters involving these marks since it is less obvious that they are separable from the basic structure of the letter. Encoding structure: The Unicode block for generic diacritcal marks is divided into the following ranges: U+0300-0332: Ordinary diacritics U+0333-0337: Overstruck diacritics U+0338-033C: Double diacritics U+033D-036F: Currently unassigned Author accepting responsibility is :_____Ken Whistler________________ Greek U+0370-03FF The Greek script is used for writing the Greek language, and (in an extended variant) for the Coptic language. Greek is ancestral to the family of scripts including Latin and Cyrillic. In this family, the main peculiarity is the occasional use of diacritical marks. Standards: The ECMA registry under ISO 2375 for use with ISO 2022 contains many Greek subsets. Unicode is based on the latest and most prominent of these: ISO 8859/7, which equals the Greek national standard ELOT 928, and also ECMA-118. ISO 8859/7: Unicode encodes Greek characters in the same relative positions as in 8859/7. Generic punctuation characters (17 of them) are unified with characters in other Unicode ranges; cross-references to such codes are given in italics below. 2nd DP ISO 10646: For the basic Greek set, 2nd DP ISO 10646 follows the arrangement of 8859/7, but it replaces many of the generic punctuation characters with various diacritics and combinations. Of these, only U+0370 "GREEK IOTA BELOW" is retained in the Greek section of Unicode; the others may be spelled with other Unicodes. 2nd DP ISO 10646 also contains dozens of baseform-diacritic combinations, which in Unicode are sequences, not single characters. ISO 5428-1980: A number of variant and archaic characters are taken into Unicode from this bibliographic standard. Diacritical marks: In Unicode, diacritical marks are spelled as separate characters occurring after the baseform character in text sequence. In general, Unicode regards baseform-diacritic combinations as sequences represented via composition, which do not receive separate codes. However, the baseform-diacritic combinations that are in 8859/7 are retained for compatibility. Several diacritical marks may be used with Greek that are not included in 8859/7. These are found in the Generic Diacritical Marks range: U+0300, 0301, 0303, 0304 U+0306, 0308, 0313, 0314 Since the marks in this range are encoded by shape, not by meaning, they are appropriate for use in Greek where applicable. Multiple diacritical marks applied onto the same baseform character are to be spelled as the baseform character followed by the several mark characters in sequence. The order of diacritic characters is from the base form outward.) Encoding structure: The Unicode block for the Greek script is divided into the following ranges: U+0370-03CF: Mapping of the standard 8859/7 U+03D0-03D6: Variant letterforms U+03D7-03D9: Punctuation-like characters U+03DA-03D1: Archaic letters U+03E2-03EF: Coptic-unique letters U+03E0-03FF: Currently unassigned Variant letterforms: Variant forms of Greek letters (sigma and beta) are encoded as separate characters in ISO 8859/7 and ISO 5428-1980, therefore this approach is taken in the Unicode set. Greek letters as symbols: A few of the Greek variants that are used primarily as technical symbols are placed in this range since they are clearly forms of Greek letters. In some cases, however, Greek letters borrowed into symbol usage may be said to have acquired separate identities, e.g. U+2126 "W"OHM SIGN vs. U+03A9 "W"GREEK CAPITAL LETTER OMEGA, or U+00B5 "m" MICRO SIGN vs. U+03BC "m" GREEK SMALL LETTER MU. Despite identical glyphs, the semantic distinctions are so great that these characters are assigned separate codes which are cross- referenced to distinguish them. Punctuation-like characters: The question of which punctuation-like characters are "uniquely Greek" and which ones can be unified with generic Western punctuation has no definitive answer. The Greek question mark U+03D7 ";" was retained for use by systems which treat it as a sentence-final punctuation in distinction from the semicolon. Archaic letters: Archaic letters have been retained from ISO 5428-1980, since there are only a few of them. Their lower-case forms also occur in 2nd DP ISO 10646. Coptic-unique letters: The Coptic script is regarded as a font/style variant of the Greek alphabet. The letters unique to Coptic have been added, since there are only a few of them. Their lower-case forms (except one) also occur in 2nd DP ISO 10646. A complete Coptic set would be obtained by rendering the whole Greek alphabet in that same style. Author accepting responsibility is :____Joe Becker_________________ Cyrillic U+0400-048F The Cyrillic script a member of the Greek family of scripts. Cyrillic has traditionally been used for writing various Slavic languages, among which Russian is now predominant. In recent years, Cyrillic has been extended for representing non-Slavic minority languages of the Soviet Union. The Cyrillic script is well-behaved, its main peculiarity being the occasional use of diacritical marks. Cyrillic letters come in uppercase/lowercase pairs. Standards: The ECMA registry under ISO 2375 for use with ISO 2022 contains several Cyrillic subsets. Unicode is based on the latest and most prominent of these: ISO 8859/5. The old Soviet standard for Russian only, GOST 13052-67, appears to be being overtaken by ISO 8859/5. GOST 13052-67: ("GOST" stands for "Government Standard".) The old Soviet standard fails to encode even the full Russian alphabet (omitting # and #). The Russian letters it does contain are encoded in order of their ASCII phonetic counterparts, not in the order of the Russian alphabet (presumably to enable automatic approximate transliteration). This approach is so counter-intuitive that no other standard follows this approach to the Russian alphabet. ISO 8859/5: Unicode encodes Cyrillic characters in the same relative positions as in 8859/5. Generic punctuation characters (4 of them) are unified with characters in other Unicode ranges; cross-references to such codes are given in italics below. 2nd DP ISO 10646: For the basic Cyrillic set, 10646 follows the arrangement of 8859/5. But 10646 also contains dozens of baseform-diacritic combinations, which in Unicode are represented by character sequences, not single characters. Diacritical marks: In the Unicode design, diacritical marks are spelled as separate characters occurring after the baseform character in text sequence. In general, Unicode regards baseform- diacritic combinations as sequences represented via composition, which do not receive separate codes. However, all of the baseform-diacritic combinations in 8859/5 are retained for compatibility. Furthermore, letterforms that might be considered as baseform-diacritic combinations but where the mark appears integral to the body of the letter are encoded as independent characters, in order to avoid dispute over whether these letters have marks or protrusions. The majority of the Extended Cyrillic characters fall into this category. Also, a few idiosyncratic combinations used in archaic Cyrillic are encoded whole because the diacritics are not productive. The only inseparable diacritical marks unique to Cyrillic are for Extended Cyrillic, and these are subject to wide typographic variability. In particular, there is a generic protrusion of the lower- right corner of a letter, which apparently originated as a generalization of the addition to U+0448 "sha #" that produces U+0449 "shcha #". This entity appears in many different graphic renditions; the ISO standard character names refer to it erroneously as "CEDILLA". Unifications: The "Cyrillic" block of Unicode contains letters of various origin, most of them clearly from Greek, a few from Hebrew (U+0448 "sha #" from U+05E9"shin #"), and some misleading (U+0455 Old Cyrillic zelo "S" not obviously from U+0073 Latin "S"). To avoid unnecessary chaos, Unicode regards all these letters as having established separate Cyrillic identities for themselves over the many centuries. In contrast, the recently-created alphabets including "Extended Cyrillic" characters for Soviet minority languages are very far from well- established. Latin characters included in those alphabets (e.g. "q" and "w" for Kurdish, or U+0292 # "yogh" for Abkhasian) are not given unique Cyrillic encodings. Languages: The language(s) using a given character are noted in cases where this information was thought to be helpful (such annotation is given only after the lowercase form, to avoid needless repetition). If such an annotation ends with an ellipsis "...", then the language(s) cited are merely the principal one(s) among many. If the annotation does not end with an ellipsis, then the cited list is thought to be complete. Glagolitic: Glagolitic is a script originally related to Cyrillic, but the history of the creation of the scripts and their relationship has been lost. Unicode regards Glagolitic as a separate script from Cyrillic, not as a font change from Cyrillic. This is primarily because Glagolitic appears unrecognizably different from Cyrillic, and secondarily because Glagolitic has not grown to match the expansion of Cyrillic. Since Glagolitic is essentially extinct, it is not encoded in the current draft of Unicode, but is expected to be in the future. Encoding structure: The Unicodes for the Cyrillic script are divided into two adjacent blocks "Cyrillic" and "Extended Cyrillic", which have the following ranges: U+0400-045F: Mapping of the standard 8859/5 U+0460-0481: Archaic letters U+0482-048F: Archaic miscellaneous U+0490-04C0: Extended Cyrillic U+04C1-04FF: Currently unassigned Archaic letters: The archaic form of the Cyrillic alphabet is regarded as a font change from modern Cyrillic, because the archaic forms are relatively close to the modern appearance and because some of them are still in modern use in languages other than Russian (e.g.,U+0406 Old Cyrillic "I" used in modern Ukrainian and Byelorussian). Since the archaic letters outside of 8859/5, i.e. those in columns U+046 through U+048, rarely occur in modern form, those letters are shown in the charts in an archaic font. A complete Old Cyrillic set would be obtained by rendering the whole "Cyrillic" section, i.e., columns U+040 through U+048, in that same style. Extended Cyrillic U+0490-04FF Extended Cyrillic: These are the baseforms used in alphabets for minority languages of the Soviet Union. The order of these letters follows 2nd DP ISO 10646 and is based (very crudely) on graphic similarity to Russian letters, not on phonetic values. Note that the scripts of some Soviet minority languages have often been revised in the past; Unicode includes only the alphabets in current use, not the rejected old letterforms. Author accepting responsibility is :_____________________ Georgian U+0500-052F The Georgian script is used primarily for writing the Georgian language. The script is very well-behaved, lacking even diacritical marks and uppercase/lowercase pairs. Archaic script form: The modern Georgian script is a style called MKHEDRULI (soldier's), which originated as the secular derivative of a form called KHUTSURI (ecclesiastical) that did have uppercase/lowercase pairs. Since KHUTSURI is essentially extinct, it is not encoded in the current draft of Unicode, but it may be in the future. Standards: 2nd DP ISO 10646: Unicode departs from the 10646 arrangement for Georgian. In Unicode, the archaic letters are placed together in a group after the modern letters. In 10646, these two groups of letters are sorted together (in an order that is open to question). Encoding structure: The Unicode block for the Georgian script is divided into the following ranges: U+0500-0520: Modern alphabet U+0521-0526: Archaic letters U+0527-052A: Currently unassigned U+052B: Punctuation U+052C-052F: Currently unassigned Author accepting responsibility is :_____________________ Armenian U+0530--058F The Armenian script is used primarily for writing the Armenian language. The script is very well-behaved, lacking even diacritical marks (although see below). It does have uppercase/lowercase pairs. Standards: 2nd DP ISO 10646: Unicode follows the 10646 arrangement for Armenian. Based on general policies, Unicode omits two digraphs and a ligature found in 10646. The character that 10646 encodes as GEORGIAN FULL STOP is encoded in Unicode as ARMENIAN FULL STOP, since its modern usage is more common in Armenian than in Georgian. Encoding structure: The Unicode block for the Armenian script is divided into the following ranges: U+0530: Currently unassigned U+0531-0556: Uppercase letters U+0557-0558: Currently unassigned U+0559-055F: Modifier letters U+0560: Currently unassigned U+0561-0586: Lowercase letters U+0587-0588 Currently unassigned U+0589: Punctuation U+058A-058F: Currently unassigned Modifier letters: The small marks in the group called Armenian modifier letters are sometimes said to be placed "above" the alphabetic letters of the words to which they apply, but in modern Armenian typography they are quite uniformly placed above and to the right, so that they actually occupy a letter position of their own. Therefore, in Unicode these objects are treated as spacing letters rather than as non-spacing diacritical marks. Author accepting responsibility is :_____Joe Becker________________ Hebrew U+0590-05FF The Hebrew script is used for writing the Hebrew language, and also Yiddish and Ladino. Vowels and various other marks are written as "points" applied to consonantal base letters; in normal writing these points are omitted. The script is written from right to left (the only other right-to-left script currently encoded in Unicode is Arabic). Final (contextual variant) letterforms: Variant forms of five Hebrew letters are encoded as separate characters in all Hebrew standards, therefore this practice is followed in the Unicode standard. Right-to-left directionality: The means of indicating right-to-left text directionality is still a hotly-debated topic (see separate discussion), but this debate has little effect on the selection and designation of the characters themselves. In fact, there appears to be widespread agreement on the only substantive encoding correlate of directionality: The punctuation marks used with the Hebrew script are not given independent codes (i.e., are unified with Latin punctuation), except for the few marks that are unique to Hebrew. Standards: ISO 8859/8: Unicode encodes the Hebrew alphabetic characters in the same relative positions as in 8859/8; however, there are no points or Hebrew punctuation characters in this standard. 2nd DP ISO 10646: Unicode follows the basic arrangement of 10646, as modified by the comments on 10646 supplied by the Standards Institution of Israel. Encoding structure: The Unicode block for the Hebrew script is divided into the following ranges: U+0590-05AF: Cantillation marks, accents U+05B0-05CF: Points and punctuation U+05D0-05EF: Mapping of ISO 8859/8 U+05F0-05F2: Yiddish digraphs U+05F3-05F4: Additional punctuation U+05F5: Additional point U+05F6-05FF: Currently unassigned Points and cantillation accents: These marks, generically called "points", indicate vowels or other modifications of consonant letters. The occurrence of a character in the "Cantillation accents" or "Points and punctuation" range, depicted with relation to a dashed circle, constitute an assertion that this character is intended to be applied via some process to the character that precedes it in the text stream, this being called the "base character". These marks may therefore be called "non-spacing" or "floating" or "flying". When rendered, these characters are intended not to occupy a spacing position by themselves. By convention, such marks may be exhibited in (apparent) isolation by applying them to the SPACE character U+0020. Unicode does not specify a sequence order in case of multiple marks applied to the same base character, since there is no possible ambiguity of interpretation. Cantillation accents: These marks are used to indicate chanting of sacred texts. There are several systems of such accents; current standards encode the Tiberian system. The literature contains great variability in the relationship between the names of these accents and their graphic forms. Points and punctuation: A few of these marks are placed "after" (to the left of) their base characters. In these cases Unicodes treats them as ordinary spacing characters. Author accepting responsibility is :_____Joe Becker________________ Arabic/Extended Arabic U+0600-06FF The Arabic script is used for writing the Arabic language, and has been extended for representing a number of other languages both major and minor: Persian, Urdu, Pashto, Sindhi, Kurdish, etc. Some languages which formerly used the Arabic script now employ the Latin or Cyrillic scripts: Indonesian/Malay, Turkish, Ingush, etc. The Arabic script is cursive even in its printed form, so that as in the handwritten tradition, the same letter may be written in many different forms depending on how it joins with its neighbors. Vowels and various other marks are written as "points" applied to consonantal base letters; in normal writing these points are omitted. The script is written from right to left (the only other right-to-left script currently encoded in Unicode is Hebrew). Semantic encoding: The basic Arabic alphabet is relatively well-defined (at least, the basic consonants), and each letter receives only one Unicode value, no matter how many different contextual appearances it may exhibit in text. Each Unicode may be said to represent the abstract character itself, or the inherent semantic identity of the letter. A word is spelled as a sequence of abstract letters, i.e. as a sequence of Unicodes. The task of converting such a spelling to a visual form, and the graphic fragments used to compose such a visual form, are matters external to character encoding. The graphic form shown in the Unicode chart for an Arabic letter (usually the form of the letter when standing by itself) is not the identity of that Unicode, but rather a mere reminder of the abstract letter it represents. Right-to-left directionality: The means of indicating right-to-left text directionality is still a hotly-debated topic (see separate discussion), but this debate has little effect on the selection and designation of the characters themselves. In fact, there appears to be widespread agreement on the only substantive encoding correlate of directionality: The punctuation marks used with the Arabic script are not given independent codes (i.e. are unified with Latin punctuation), except for the few cases where the mark has a significantly different appearance in Arabic, namely: U+060C # comma, U+061B # semicolon, U+061F # question mark, U+066A # percent sign. Standards: ISO 8859/6 = ECMA-114 = ASMO 449: There is a relatively well-established standard encoding for Arabic; Unicode therefore places the basic Arabic characters in the standard relative positions as this standard. This Arabic standard order is worth adhering to despite foibles such as the remarkable gap this leaves in the alphabet (U+063B-0640) and the omission of all "extended" Arabic letters needed for other languages in this family. 2nd DP ISO 10646: 10646 follows the arrangement of 8859/6 for the basic Arabic characters. It also contains other Arabic forms scattered with no obvious logic into three different areas: extended Arabic letters, digits, and "presentation forms". Unicode includes the "extended" letters and digits because they are needed for other languages in the family, but, as a rule, does not encode Arabic "presentation forms" because they are not characters. Encoding structure: Unicodes for Arabic scripts are divided into two adjacent blocks "Arabic" and "Extended Arabic", which have the following ranges: U+0600-064A: Basic Arabic characters as mapped in ISO 8859/6 U+064B-065F: Points from 8859/6 U+0660-066F: Extended Arabic: "Indic" digits U+0670: Extended Arabic: Additional point U+0671-06D4: Extended Arabic letters U+06D5-06FF: Currently unassigned Points: Points are marks that indicate vowels or other modifications of consonant letters. The occurrence of a character in the "Points" range, and its depiction with relation to a dashed circle, constitute an assertion that this character is intended to be applied via some process to the character that precedes it in the text stream, this being called the "base character". These marks may therefore be called "non-spacing" or "floating" or "flying". When rendered, these characters are intended not to occupy a spacing position by themselves. By convention, such marks may be exhibited in (apparent) isolation by applying them to the SPACE character U+0020. Unicode does not specify a sequence order in case of multiple marks applied to the same Arabic base character, since there is no possible ambiguity of interpretation. "Indic" digits: The "Indic" digits are those used in conjunction with the Arabic script (the term "Indic" is used to avoid the ambiguity of the term "Arabic digits"). Unicode assigns separate codes to the digits of each script, just as it does to the letters of each script. The Persian and Urdu variant digits are given separate codes under the principle of "glyphic coding," discussed below. Extended Arabic letters: The set of letters encoded in this section unavoidably contains spurious forms. The Arabic script has been extended for some relatively obscure languages (e.g. Baluchi, Lahnda) which have little tradition in printed typography. Although the available information on variant handwritten forms is sporadic and inconsistent, it is clear that in many cases the extended letters for obscure languages overlap with the well-defined character extensions used by major languages like Persian (Farsi) and Pashto. In this situation of imperfect information, Unicode adopts a "glyphic" approach to the baseform letters and variant digits in the Extended Arabic block. There are often different characters for the same "semantic" (or sound), and different "semantics" (or sounds) attributed to the same characters by different languages. The best we can do is to supply a superset of the various characters to choose from; codes that are not needed (and/or regarded as invalid) should simply be ignored. Given imperfect information and the risk of omitting valid characters, this approach was felt to be the most practical. Within this framework, however, the graphic form shown in the Unicode chart for an Extended Arabic letter remains merely the stand-alone form of the abstract letter, just as in the chart of the basic Arabic alphabet. The names given to extended Arabic characters are entirely artificial, intended only to create unique identifiers. The language(s) using a given character are indicated, even though this information is incomplete. When such an annotation ends with an ellipsis "...", then the languages cited are merely the known principal ones among many. Plurals in Farsi Subject: Re: Arabic languages - Algorithmic shaping Cc: fortran@ibm.com, khan@btc.kodak.com >>If this is the issue referred to, then the only problem is >>determining whether in Farsi the correct typography would be to >>separate off the plural suffix with a normal space in rendering or >>with a thinspace. The contextual shaping of the individual glyphs >>is otherwise perfectly regular. It is a separate issue to determine >>the correct and expected UI for entering Farsi which has this >>typographical behavior. The solution suggested above for handling the plural suffix is an acceptable solution. The plural suffix is often written after the word with very little space in between the two. The use of a thin space to separate the two would be okay, and should be preferred becasue it keeps the context analysis algorithm perfectly regular. Further more, such separation between the plural suffix and the word is not a universal practice. There are many instances when the plural is joined with the rest of the word using regualr joining rules. An example would be the plural for the word "shub" which means night, and whose plural is "shub-ha" which can be, and is written both ways. Thus such usage is more correctly a typographic refinement and should be user selectable through the appropriate UI with provisions for entry of various types of spacing elements. The use of thin spaces is also needed because in some cases one wants to force isolated forms of the characters in a word, and this is the only way to do it correctly. Author accepting responsibility is :______Joe Becker_______________ Ethiopian U+0700-081F The Ethiopian script is used for writing several languages of the area, including Amharic, Tigre, and Oromo. The script, which is based on the writing of a dead language Ge'ez, is graphically well-behaved. However, it is a syllabary rather than an alphabet, which has several encoding consequences discussed below. Array structure: The basic Ge'ez syllabary is traditionally arranged as an array of 33 consonant initials crossed with 7 vowel finals. Since most of the consonants also take a labialized final, this can be expanded to a 33 x 8 array, which is ideal for encoding. This orderly array forms the basis for the Unicode "Ethiopian" block; other characters are added afterward in a less systematic fashion. Standards: 2nd DP ISO 10646: The 10646 arrangement for Ethiopian is also derived originally from the 33 x 8 syllabic array, but in 10646 this array is destroyed by the impossibility of forcing it into a "graphic character set" structure of 94 codes. Encoding structure: The Unicodes for the Ethiopian script are divided into two adjacent blocks "Ethiopian" and "Extended Ethiopian", which have the following ranges: U+0700-0807: Basic Ge'ez syllabary U+0808-081B: Numbers U+081C-081F: Punctuation U+0820-082F: Variant letters U+0830-0832: Additional punctuation U+0833-083F: Diacritical marks U+0840-089E: Additional letters U+089F-08FF: Currently unassigned Variant letters: These are common but unsystematic variants of letters in the syllabic array. Diacritical marks: The Ethiopian syllabic letterforms in most cases reveal their origin as composites of a consonant base character plus a vowel diacritical mark, with labialization represented by a further diacritical mark. In Unicode the syllabic letters are represented as whole codes, rather than by composition, because the composites have truly become the units of the script (and besides, the compositional rules are very irregular). However, a syllabary is more difficult to extend than an alphabet, and there may be merit in accomplishing some extensions via the application of diacritical marks. The few marks in this range appear to be the most productive in producing extensions, and are provided in case there is a desire to use them in this fashion. Extended Ethiopian U+0820-08FF Extended Ethiopian letters: This group includes some extensions of the basic syllabary, plus a set of labialized series that is now part of the standard script (and which in some cases replicates syllables in the main array). The characters are arranged according to the same N x 8 scheme as the main array. The names given to the extended Ethiopian characters are somewhat artificial, intended mainly to create a unique identifier. The Ethiopian script has been extended for some relatively obscure languages which may have little tradition of printed typography, and obsolete alternative forms of some letters also exist. The available information on variant letter forms is often sporadic and inconsistent, so some of the codes may be regarded as unneeded (and/or invalid) for some applications. It is assumed that the encoding of various languages will make use of various different subsets of these extensions. Given the imperfection of information and the bulkiness of extensions to a syllabary, the currently unassigned range has been made larger for Ethiopian than for other scripts (enough singly-attested forms have already been collected to fill it). Author accepting responsibility is :_____________________ Devanagari U+0900-097F Block introduction not yet written. Author accepting responsibility is :_____________________ Bengali U+0980-09FF Block introduction not yet written. Author accepting responsibility is :_____________________ Gurmukhi U+0A00-0A7F Block introduction not yet written. Author accepting responsibility is :_____________________ Gujarati U+0A80-0AFF Block introduction not yet written. Author accepting responsibility is :_____________________ Oriya U+0B00-0B7F Block introduction not yet written. Author accepting responsibility is :_____________________ Tamil U+0B80-0BFF Block introduction not yet written. Author accepting responsibility is :_____________________ Telegu U+0C00-0C7F Block introduction not yet written. Author accepting responsibility is :_____________________ Kannada U+0C80-0CFF Block introduction not yet written. Author accepting responsibility is :_____________________ Malayalam U+0D00-0D7F Block introduction not yet written. Author accepting responsibility is :_____________________ Sinhalese U+0D80-0DFF Block introduction not yet written. Author accepting responsibility is :_____________________ Thai U+0E00-0E7F Block introduction not yet written. Author accepting responsibility is :_____________________ Lao U+0E80-0EFF Block introduction not yet written. Author accepting responsibility is :_____________________ Burmese U+0F00-0F7F Block introduction not yet written. Author accepting responsibility is :_____________________ Khmer U+0F80-0FFF Block introduction not yet written. Author accepting responsibility is :_____________________ Tibetan U+1000-107F Block introduction not yet written. Author accepting responsibility is :_____________________ Mongolian U+1080-10FF (to be defined) Block introduction not yet written. Author accepting responsibility is :_____________________ General Punctuation U+2000-206F General punctuation combines punctuation characters and character like elements used to achieve certain text layout effects. The former contain punctuation which can be used with many different scripts. Many general punctuation characters can also be found in the Unicode ASCII and Latin1 blocks. Punctuation felt to belong to a specific script is found in the block corresponding to that script, e.g. the Greek question mark U+03D7 ";" or the punctuation used with ideographs in the CJK Symbols block. For decimal points and thousands separators, several encodings were supplied to provide applications with the ability to encode these either glyphically or semantically depending on their processing needs. (Latest version revised this, but I don't have the details yet./ed) !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for the general punctuation is divided into the following ranges: U+2000-200A: Typographical space characters U+200B-200F: Zero width layout characters U+2010-203E: Printing punctuation characters U+203F-206F: Currently unassigned Typographical space characters: These are encoded glyphically and allow fine control over the width of the space character. Zero width layout characters: Occasionally it is desirable to indicate to software formatting text that adjacent characters do or do not run together, or in the case of mixed left-to-right right-to- left nested text runs to disambiguate the direction of characters that do not carry an intrinsic directionality. For this purpose Unicode provides zero width layout characters. The Zero width space U+200B acts just like any other space character, except that is has zero width. The non- joiner U+200C , if placed between e.g. f and i would prohibit the use of the "fi" ligature by the formatting software. The joiner U+200D has the opposite effect. The left-to-right marker U+200E and the right-to-left marker U+200F can be used to override the formatting software's default decision about the directionality of a given character or text-run by providing a non- printing character of a given directionality. Except for their effect on the layout of the text in which they are contained these zero width layout characters can be treated just as any other character by the processing software; in particular they are not introducing a mode or state into the character sequence. For non-layout text processing, such as sorting, searching etc. they can simply be filtered out. Author accepting responsibility is :____Ken Whistler_________________ Superscripts and Subscripts U+2070-209F !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for superscripts and subsscripts is divided into the following ranges: U+2070-2070: Superscript 0 U+2071-2073: Reserved U+2074-207F: Superscript U+2080-208E: Subscripts U+208F-209F: Currently unassigned Author accepting responsibility is :____Ken Whistler_________________ Currency U+20A0-20CF !! NOTE: Standards mention is tentative This block contains currency symbols. Other currency symbols are encoded in the ASCII and Latin1 blocks. Encoding structure: The Unicode block for currency is divided into the following ranges: U+20A0-20A9: Currency Symbols U+20AA-20CF: Currently unassigned Author accepting responsibility is :_____________________ Diacritics U+20D0-20FF !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for diacritics is divided into the following ranges: U+20D0-20E1: U+20E1-20FF: Currently unassigned Author accepting responsibility is :____Ken Whistler_________________ Letterlike Symbols U+2100-214F !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for letterlike symbols is divided into the following ranges: U+2100-2129: Letterlike symbols U+212A-214F: Currently unassigned Author accepting responsibility is :____Ken Whistler_________________ Number Forms U+2150-218F !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for the number forms pix is divided into the following ranges: U+2150-2152: Overstruck forms of digits U+2153-215f: Vulgar fractions U+2160-2182: Roman numerals and small roman numerals U+2183-218f: Currently unassigned Author accepting responsibility is :____Ken Whistler_________________ U+2190-21FF Arrows !! NOTE: Standards mention is tentative Glyphic encoding: Because the arrows have such a wide vriety of applications, the encoding of this block is intentionally "glyphic" rather than "semantic". Thus there may be several sematics for the same Unicode, e.g., U+2185 " " downward left arrow = carriage return. And there are several essentially stylistic variants for each of the basic arrow forms. Encoding structure: The Unicode block for arrows is divided into the following ranges: U+2190-21EA: Arrows U+21EB-21FF: Currently unassigned Author accepting responsibility is :_____________________ Mathematical Operators U+2200-22FF !! NOTE: Standards mention is tentative Mathematical operators are also found in the ASCII and Latin1 blocks. In addition, symbols from the miscellaneous technical block, and characters from general punctuation are also often used for mathematical notation. Mathematical operators such as "implies" and "if and only if" "" have been unified with the corresponding arrows in the arrows block (U+21D2, U+21D4 ). Latin letters in special font styles, such as script P for the Weierstrass elliptic function U+2118, are to be found in the block letterlike symbols. There are two Greek letters used for semantic units which are not part of the Greek block. These are "micro" U+00B5 "m" in block Latin1 and the "Ohm sign" U+2126 "W" in Letterlike symbols. All other greek characters with special mathematical semantics have been unified with the Greek characters in the Greek block because their mathematical semantics do not distinguish them substantially from Greek letters. Glyphic encoding: Because mathematics operators have such a wide variety of applications, the encoding of this block is intentionally "glyphic" rather than "semantic". There may be several sematics for the same Unicode, e.g. U+2218 circle bullet = composite function = APL jot. And there are several essentially stylistic variants for many operators, e.g., U+2208 = U+220b = U+228A; all encode "is an element of." Encoding structure: The Unicode block for the mathemtics operators is divided into the following ranges: U+2200-22C3: Mathematics operators U+22C4-22FF: Currently unassigned Author accepting responsibility is :____Asmus Freytag_________________ Miscellaneous Technical U+2300-23FF !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for the control code pix is divided into the following ranges: U+2300-2328: Miscellaneous technical symbols U+2329-23FF: Currently unassigned Author accepting responsibility is :____Asmus Freytag_________________ Control Pix U+2400-243F !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for the control code pix is divided into the following ranges: U+2400-241F: Pictorial representation for control codes U+0000-001F U+2420-2423: Pictorial representations for "Space" and "Delete" U+2424-243F: Currently unassigned Author accepting responsibility is :_____________________ OCR U+2440-245F !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for OCR is divided into the following ranges: U+2440-244A OCR Symbols U+244B-245F: Currently unassigned Author accepting responsibility is :_____________________ Enclosed Alphanumerics U+2460-24FF !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for enclosed alphanumerics is divided into the following ranges: U+2460-2473: Encircled numbers 1-20 U+2474-2487: Parenthesized numbers 1-20 U+2488-249B: Numbers with period 1-20 U+249C-24B5: Parenthesized small Latin a-z U+24B6-24CF: Encircled capital Latin A-Z U+24D0-24E9: Encircled small Latin a-z U+24EA-24FF: Currently unassigned Author accepting responsibility is :_____________________ Form and Chart Components U+2500-257F Forms !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for forms is divided into the following ranges: U+2500-254F: Single line box and line drawing elements U+2550-256C: Line box drawing elements with double line segments U+256D-2574: Miscellaneous U+2575-257F: Currently unassigned Blocks !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for blocks is divided into the following ranges: U+2580-2593 Block and bar characters U+2594-259F: Currently unassigned Geometric Shapes !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for Geometric Shapes is divided into the following ranges: U+25A0-25E5 Geometric shapes U+25E6-25FF: Currently unassigned Author accepting responsibility is :_____________________ Basic Dingbats & Miscellaneous U+2600-26FF Basic Dingbats & Miscellaneous !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for Basic Dingbats and Miscellaneous is divided into the following ranges: U+2600-2674 Basic Dingbats and Miscellaneous U+2675-26FF: Currently unassigned Author accepting responsibility is :_____________________ Chinese/Japanese/Korean Non-ideographic Symbols U+3000-33FF CJK Symbols and Punctuation U+3000-303F Standards: Based on 2nd DP ISO 10646 Encoding structure: The Unicode block for CJK Symbols and Punctuation is divided into the following ranges: U+3000-3031 CJK Current Symbols and Punctuation U+3032-303F: Currently unassigned Hiragana U+3040-309F Hiragana is the cursive syllabary used to phonetically write Japanese words, sentence particles and inflectional endings. Hiragana are commonly used as well to indicate the pronunciation of Japanese words. Hiragana are phonetically equivalent to corresponding Katakana syllables. Standards: the Unicode Hiragana block is based on the JIS X 0208-1983 standard, extended by the non-standard syllable U+3094 VU, which is included to accommodate 1:1 mapping between Katakana and Hiragana syllables. Encoding structure: The Unicode block for the Hiragana script is divided into the following ranges: U+3040-3093: Mapping of the JIS X 0208 standard U+3094: Variant form U+3095-309A: Currently unassigned U+309B-309C: Diacritical marks U+309D-309E: Punctuation like characters U+309F: Currently unassigned Diacritical marks: Hiragana and the related script Katakana use the two diacritics encoded in this block to generate voiced and semi-voiced syllables from the base syllables. In the Unicode design, these diacritical marks follow the base character. Punctuation-like characters: These are the Hiragana specific iteration and voiced iteration marks. Katakana U+A000-30FF Katakana is the syllabary used to phonetically write non-Japanese (usually Western) words. Katakana are commonly used as well to write Japanese words in order to create visual emphasis. Katakana are phonetically equivalent to corresponding Hiragana syllables. Standards: the Unicode Katakana block is based on the JIS X 0208-1983 standard. Encoding structure: The Unicode block for the Hiragana script is divided into the following ranges: U+30A0-30F6: Mapping of the JIS X 0208 standard U+30F7-30FB: Currently unassigned U+30FC-30FE: Punctuation like characters U+30FF: Currently unassigned Punctuation-like characters: These are the Katakana conjunctive, the Hiragana/Katakana prolonged-syllable mark, the specific iteration and the voiced iteration marks. Author accepting responsibility is :_____Lee Collins_____ Zhuyinfuhao: Chinese Bopomofo Phonetic Symbols U+00-312F Standards: Based on the GB2312-80, Big-5, and CNS Standards Encoding structure: The Unicode block for Bopomofo is divided into the following ranges: U+3100-312A: Mapping of GB2312-80, CNS, and IBM Big-5 Bopomofo Sections U+312B-312F: Currently unassigned Author accepting responsibility is :____Jim Caldwell_________________ Hangul Elements: Basic Korean Phonetic Symbols U+30-318F Standards: Unicode follows KS C 5601-87 for Hangul elements. Encoding structure: The Unicode block for Hangul elements is divided into the following ranges: U+3130-3163: Mapping of KS C 5601 standard: Modern Jamo elements U+3164-318E: Mapping of KS C 5601 standard: Archaic Jamo elements U+318F: Currently unassigned Author accepting responsibility is :____Lee Collins______ More CJK SymbolsU+90-319F Currently this block contains Unicodes for the four most recent Japanese eras, U+3190 # = Meiji era 1867 - 1912, U+3191 # = Taishou era 1912 - 1926, U+3192 # = Showa era 1926 - 1989, U+3193 # - Heisei era 1989 - Encoding structure: The Unicode block for more CJK symbols is divided into the following ranges: U+3190-3193: Japanese era names U+3194-31FF: Currently unassigned Author accepting responsibility is :____Lee Collins______ CJK Parenthesized, Circled and Squared Abbreviations U+3200-33FF CJK Parenthesized U+3200-325F !! NOTE: Standards mention is tentative Standards: The CJK Parenthesized block provides mapping for all the parenthesized Hangul elements from Korean standard KS C 5601 as well as parenthesized ideographic characters from JIS ?? standard, CNS ???? as well as several corporate registries. Encoding structure: The Unicode block for CJK Parenthesized is divided into the following ranges: 3200-320D Parenthesized Hangul Elements 320E-321F Parenthesized Hangul syllables 3220-323A Parenthesized ideographs 323B-325F Currently unassigned CJK Encircled U+3260-32FF U+3260-326D: Circled Hangul elements U+326E-327B: Circled Hangul syllables U+327C-327F: Currently unassigned U+327F: Korean Standard Symbol U+3280-32A8: Circled ideographs U+32A9-32CF: Currently unassigned U+32D0-32FE: Circled Katakana U+32FF: Japanese Industrial Standard symbol Author accepting responsibility is :____Lee Collins_________________ CJK Squared Katakana Words and Latin Abbreviation Symbols U+3300-33FF CJK squared Katakana words are katakana spelled words that fill a single characters position if intermixed with ideographic Kanji characters. The set of squared Katakana words and Latin abbreviation symbols is derived from various company registries. Encoding structure: The Unicode block for CJK squared symbolic abbreviations is divided into the following ranges: U+3300-335A: Squared Symbolic Katakana Words U+335B-337F: Currently unassigned U+3380-33DD: Squared Latin Abbreviation Symbols U+33DE-33FF: Currently unassigned Korean Hangul SyllablesU+3400- Korean Hangul Syllables !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for Hangul syllables is divided into the following ranges: U+3190-3193: U+3194-31ff: Currently unassigned Extended Korean Hangul Syllables !! NOTE: Standards mention is tentative Encoding structure: The Unicode block for extended Hangul syllables is divided into the following ranges: U+3190-3193: U+3194-31FF: Currently unassigned Author accepting responsibility is :_____Lee Collins________________ Extended Korean Hangul Syllables (cont.) 3E Author accepting responsibility is :____Lee Collins_________________ Chinese/Japanese/Korean Ideographs U+4000 From: Becker.OSBU_North@xerox.com Subject: UniHan Levels To: Unicode Cc: davis.mark@applelink.apple.COM, liao@apple.com, Becker.OSBU_North@xerox.com Message-Id: <"27-Sep-90 11:57:53 PDT".*.Joseph_D._Becker.OSBU_North@Xerox.com> The proposed content of the UniHan Levels appears to have stabilized: ------------------------------------------------------------------ Level I "Common" (roughly 10,500 characters) Major Standards All of GB 2312-80 "G0" ( 6,763) All of GB ....... "G1" ( 6,951) All of JIS X0208-1983 ( 6,353) All of KS C5601-1987 ( 4,888) Taiwan CNS 11643-86 / Big Five LEVEL 1 ( 5,401) Taiwan CNS 11643-86 / Big Five "symbols" ( 9) Taiwan CCCII "Common" Level ( 4,808) ------------------------------------------------------------------ Level II "Secondary" (roughly 8,500 characters) Major Standards Rest of JIS draft supplementary set ( 5,843) Rest of Taiwan CNS 11643-86 / Big Five LEVEL 2 ( 7,652) Rest of ANSI/ NISO Z39.64-1989 = EACC (13,481) Other Sources Rest of Xerox corporate collection Includes Telegraph Codes, Cantonese, etc. ( 9,776) ------------------------------------------------------------------ ------------------------------------------------------------------ Level III "Rare" (...) Major Standards Rest of Taiwan CNS proposed extensions ( 6,339) Rest of Taiwan CCCII "Next Freq" Level (17,032) Rest of GB 7589-87 "G2" ( 7,144) Rest of GB ....... "G3" ( 7,144) Rest of GB 7590-87 "G4" ( 6,956) Rest of GB ....... "G5" ( 6,956) Rest of other future national extensions ( ?) Other Sources XinHua News Agency additions ( 694) GB Korean "Yidu" row ( 94) Rest of Japanese corporate standards ( ?) Rest of Taiwan phone company name lists ( ?) Rest of selected fonts, dictionaries, etc. ( ?) ------------------------------------------------------------------ (The Xerox corporate collection is included in Level II because it represents years of research into characters which are useful but which are not included in national standards, e.g. characters specific to writing Cantonese.) UniHan Version 1.0 will consist of Levels I & II. Level I encompasses today's existing standards that are in the 6,500 character range. For pragmatic reasons, the remainder of existing standards in the 13,000 character range are placed in Level II. The content listed above for Level III is merely suggestive; requests for membership in Level III could accumulate for the rest of the century. This approach enables generic "Multilingual/International" systems to implement UniHan Level I, which would become the one fixed standard Han character set for covering all genuinely common CJK usage. At the same time, Level II would be available for producing full-functionality systems. Level III would eventually serve the needs of specialist applications. Joe Since each UniHan level is to be sorted in "radical/stroke order", that ordering needs to be precisely defined. The following are sketches toward making that definition. THE RADICALS The overriding goal is to make the minimal augmentation to the traditional KangXi system to be able to accommodate the PRC simplified characters. There is no attempt at all to make any innovative reform to the KangXi system. In particular, all traditional characters will receive a totally traditional treatment, so the only real problem is to define the treatment of simplified characters. Thus, the UniHan radicals will consist of the 214 traditional KangXi radicals plus some number of PRC simplified radicals. The authorities taken for the PRC simplified radicals are the encoding standard GB2312-80 plus two authoritative dictionaries Xin CiHai (XCH) and XianDai HanYu CiDian (XDHYCD). Based on these, the proposed list of 22 PRC simplified radicals is as follows: ---------------------------------------- XDHYCD Trad Meaning ------ ---- ------- 27 149 speech 59 184 food 63 169 door 64 90 bed 76 187 horse 77 120 silk 83 178 leather (wei) 91 159 vehicle 102 154 cowry shell 103 147 see 116 182 wind 137 212 dragon 146 167 gold, metal 152 196 bird 171 181 page 187 210 alike 195 199 wheat 203 197 salt 210 213 tortoise 219 211 tooth 221 205 frog 223 195 fish ---------------------------------------- Included are the 2 simplified forms of traditional radicals that are in XCH & XDHYCD but not in GB2312-80: XDHYCD Trad Meaning ------ ---- ------- 187 210 alike 210 213 tortoise Excluded are all newly added PRC radicals that are not simplified versions of traditional radicals, in particular the 2 that are in GB2312-80: GB2312 Sound Meaning ----- ----- ------- 111 ye4 industry (simplified form) 169 qi2 its (mo-ming-qi-miao de!) Excluded are all revisions, reassignments, recombinations, and separated variants of the 214 traditional KangXi radicals, for example: XDHYCD Trad Meaning ------ ---- ------- 46 64 hand (ti shou pang) 65 85 water (san dian shui) DETERMINING THE RADICAL OF A CHARACTER: (1) If the character itself is a (Uni)Radical, it is assigned under itself Example: * The traditional character for "dragon" is assigned to traditional Radical 212 * The Japanese simplified character for "dragon" is also assigned to traditional Radical 212 (since the Japanese themselves use this approach and not additional simplified radicals) * The Chinese simplified character for "dragon" is assigned to the simplified radical for "dragon" (and not to any graphical sub-fragment of it) (2) If the character has a traditional KangXi radical (ala Dai KanWa JiTen, CCCII, etc.), use that This includes special cases: > All Japanese and Korean -unique characters are assigned into the KangXi system as is done in their native dictionaries and in JIS standards > Traditional characters having traditional radicals that GB2312-80 & XCH & XDHYCD treat in innovative ways are to be treated in the traditional way (e.g. characters having san-dian-shui are mixed in at random with the other Radical 85'ers as is traditional) > Simplified characters (other than radicals themselves) which still contain the same radical as their traditional form are assigned to the traditional value of that radical Example: the simplified form of hu2 "(tea)pot" contains the same radical as the unsimplified form (traditional 33), but XCH & XDHYCD (not GB2312-80!) map Radical 33 to Radical 32; in UniHan the wholly traditional Radical 33 would be used > Simplified characters (other than radicals themselves) which no longer contain their traditional radical at all are assigned to the "new" radical given by GB2312-80 & XCH & XDHYCD that is actually a fragment of the simplified glyph, and not to the same radical as the unsimplified version of the character Example: the 3-stroke simplified form of wei4 "to protect" (as in weisheng or weibing) is assigned to traditional Radical 26, and not to traditional Radical 144 which is the radical of the unsimplified version of the same character Discussion: this convention seems to make more practical sense than assigning a character to a radical that is visually unrelated to its glyph (3) If the character has one of the 22 PRC simplified versions of traditional radicals, use that (4) Otherwise, in the rare cases not covered above, improvise Example: the simplified form of ye4 "industry" does not fall into categories (1)-(3), suggest assigning it to UniRad 1 for instance THE ORDERING OF RADICALS We have considered four possible schemes for ordering characters having PRC simplified radicals relative to characters having traditional KangXi radicals: (1) Intersperse at the character level: characters having PRC radicals immediately follow their unsimplified counterparts (2A) Intersperse at the group level such that the group of all characters having a given the PRC radical immediately follows the group of all characters having that radical's unsimplified counterpart (e.g. the 2-stroke PRC simplified form of Radical 149 "speech" would immediately follow the 7-stroke Radical 149) (2B) Intersperse at the group level based on the stroke count of the radical (e.g. the 2-stroke PRC simplified form of Radical 149 "speech" would be near the front) (3) Segregate all PRC radicals to the end, so that the first 214 radicals are the traditional KangXi ones and then the PRC simplified radicals follow as numbers 215 through 236 Although no ordering is free from problems, we picked one of the above as the most appropriate for UniHan ... see if you can guess which ... Joe Draft already in Manual Editor is working with Author to revise Author accepting responsibility is :____Lee Collins_________________ Private Use Area (Codes defined by Private Agreements)U+F000-FFFE Author accepting responsibility is :_____________________ Compatibility Zone for IBM CodePages Author accepting responsibility is :_____________________ Unicode Draft: Character Blocks and Block Introductions 9/27/90