From portal!cup.portal.com!James_T_Caldwell@Sun.COM Tue Oct  9 08:25:05 1990
From: portal!cup.portal.com!James_T_Caldwell@Sun.COM
To: asumusf%microsoft@Sun.COM, u-core%noddy@Sun.COM
Subject: bloks
Date: Tue,  9 Oct 90 00:19:51 PDT

Unicore,
Here again is the blocks list for comment and revision.  It has bounced
twice, costing a week, from u-core@sun.com and u-core@noddy.sun.com
Hope this works.  
CHARACTER BLOCKS AND BLOCK INTRODUCTIONS

The first 256 Unicodes may be considered as a group, although they fall into four distinct sub-
blocks:
	U+0000-001F:  	C0 ASCII control codes
	U+0020-007F:  	ASCII graphic characters
	U+0080-009F:  	C1 control codes
	U+00A0-00FF:  	ISO 8859/1 (aka Latin1)Standards:  

Unicode adapts the ISO standards for 7-bit and 8-bit characters by retaining the semantics and 
numeric values of these codes, merely supplying enough leading zeroes to convert them into 16-
bit numbers.  In terms of a 16-bit space, the content and arrangement of these standards is far 
from optimal, but Unicode retains them without change because of their prevalence in existing 
usage.

ISO 646:  The ISO character encoding standards are founded on 646, "7-bit coded 
character set for information processing interchange".  This provides an "international" 
set of interpretations for numeric values U+0000-007F, which are intended to be 
"localized" into national standard codes.

ANSI X3.4-1977:  This is ASCII: "American National Standard Code for Information 
Interchange".  ASCII is the version of ISO 646 "localized" into the national standard code 
for the USA.  In the few places where ISO 646 and ASCII may differ, Unicode gives 
priority to the "specific" interpretations of ASCII rather than to the "generic" 
interpretations of ISO 646.  (For example, at code U+0024, ISO 646 has the generic 
"international currency symbol", whereas in ASCII and Unicode this is localized to the 
dollar sign.)  The principle is to have explicit unambiguous character codes. This code is 
assigned to the American dollar sign because it is so in ASCII. Other currency symbols 
are likewise given their own code points within the appropriate blocks.

ISO 8859/1:  Also known as "Latin1", this extension is intended to supply a "most 
broadly useful" 8-bit complement to ISO 646, by providing additional letters extending 
the Latin alphabet to cover certain major languages of Europe (listed below).


ASCII (C0) Control Codes U+0000-001F 

C0 ASCII control codes:  The role of "control codes" in Unicode is discussed elsewhere.  Unicode 
makes no particular use of these control codes, but merely provides for the passage of the 
numeric code values intact, neither adding to nor subtracting from their semantics.


ASCII Characters U+0020-007F 

ASCII graphic characters:  (Technically, codes U+0020 SPACE and U+007F DELETE are control 
codes, the remaining 94 codes in this range are graphic characters.)  Some of the non-letter 
characters in this range suffer from overburdened usage as a result of the limited number of 
codes in a 7-bit space.  Some coding consequences of this are discussed below under "Semantic 
vs. glyphic encoding" and "Loose vs. precise semantics".  The rather haphazard ASCII collection 
of punctuation and mathematical signs are isolated from the larger body of Unicode 
punctuation, signs, and symbols (which are encoded in ranges starting at U+2000) only because 
the relative locations within ASCII are so widely used in standards and software.


Latin1 (C1) Control Codes U+00A0-009F 

C1 control codes:  "Control codes" in the C1 range are assigned interpretations in various ISO 
standards, but do not have the force of long established usage as do those in the C0 range.  
Whatever the eventual assignments in the C1 range may be, Unicode makes no particular use of 
them, it merely provides for the passage of the numeric code values intact.

Latin1 Characters U+00A0-00FF 

ISO 8859/1 (aka Latin1):  Unicode specifies that combinations of a base letter plus a diacritical 
mark be coded out as two separate character codes.  However,because ISO 8859, in serving 
those who desire to assign single codes for the most commonly used baseform-mark 
combinations, Unicode offers separate codepoints for these composed characters, treating them 
as if they were single characters.  Unfortunately, this engenders multiple spellings single 
constructs, introducing ambiguity for users. Therefore, even though these composite characters 
are included and can be used, pure Unicode implementations will code the diacritics separately.  
The languages that were formally targeted for coverage by extended Latin ISO 8859/1 (which 
supplements the Latin characters in ISO 646) are:

Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, 
Norwegian, Portuguese, Spanish, and Swedish.

Many other languages can be written with this set of letters, including Hawaiian, 
Indonesian/Malay, and Swahili.  The characters within this group that have relatively limited 
use are annotated with the major language(s) employing them.  The characters in ISO 8859/2, 
8859/3, and 8859/4 (additional Extended Latin characters) are encoded in the following 
Unicode block.  Like ASCII, the Latin1 set includes a rather miscellaneous set of punctuation 
and mathematical signs.  Punctuation, signs, and symbols, not included in ASCII and ISO-8859 
are encoded in Unicode addresses starting at U+2000.

"Diacritical mark" characters:  ASCII contains four codes which it treats as potential diacritical 
marks: U+005E, U+005F, U+0060, U+007E;   Latin1 contains five such codes: U+00A8, U+00AF, 
U+00B0, U+00B4, U+00B8.  In Unicode, these codepoints are unambiguously restricted to use as 
spacing characters; the corresponding non-spacing characters are coded elsewhere and cross-
referenced.

Semantic vs. glyphic encoding:  Because the numeric code values in this range are well-
established and widely used in various implementations, Unicode assigns minimal 
specifications on the typographic appearance of corresponding glyphs.  For example, the value 
ASCII 0024 has the semantic "dollar sign" in the US, leaving open the question of whether the 
dollar sign is to be rendered with one vertical stroke or two.  Thus, this Unicode value is taken 
to refer to the identity the "dollar sign" semantic, not to its precise appearance.  Thus, for the 
codes in this range that are indicated with alternative glyphs, the code is associated with the 
basic usage, and different systems are free to present the particular graphical form of their 
choice.

Loose vs. precise semantics:  Some ASCII characters have multiple uses, either through 
ambiguity in the original standards or through accumulated reinterpretations of a limited 
codeset.  For example, U+0027 is defined in ANSI X3.4-1977 as "Apostrophe (Closing Single 
Quotation Mark; Acute Accent)", and U+002D as "Hyphen (Minus)".  In general, Unicode 
intends merely to provide for the passage of these numeric code values intact, without adding to 
or subtracting from their semantics.  Unicode supplies unambiguous codes elsewhere for the 
most useful particular interpretations of these ASCII values, and the corresponding 
unambiguous characters are cross-referenced.  In a very few cases, Unicode indicates a 
preferred interpretation of an ASCII code, e.g. U+0027 is intended to be neutral (vertical)  .

Author accepting responsibility is

:___Joe Becker__________________





European Latin U+0100-017E 

Extended Latin U+0180-024F 

The "Extended Latin" block is provided as a grab-bag of letterforms used to extend the Latin 
script for non-European languages, phonetic symbols (other than the standard International 
Phonetic Alphabet symbols in the following block), or other special uses.

Standards:  This block covers, among other things, a registered standard for graphic characters 
used by African languages, ISO 6438 = German Standard DIN 31625, plus "Pinyin" Latin 
transcription characters in the People's Republic of China national standard GB 2312-80.

Encoding structure:  The Unicode block for Extended Latin is divided into the following ranges:
	U+0180-01C3:  	Extended Latin
	U+01C4-01CC:  	Croatian digraphs matching Serbian Cyrillic letters
	U+01CD-01D4:  	Pinyin diacritic-vowel combinations
	U+01D5-024F:  	Additions and currently unassigned

Extended Latin:  This group is merely a union of forms collected from a variety of different 
sources (the single greatest source is ISO 6438).  The forms are arranged in approximate Latin 
alphabetical order.  Upper/lower case pairs are placed together where possible, but in many 
cases the other case forms is encoded at some rather distant location, and so is cross-referenced.   
The arrangement is not particularly defensible, but for different variations on the same base 
letter, the order is as follows: turned, inverted, hook attachment, stroke extension or 
modification, different style (e.g. script), small cap, modified basic form, ligature,  greek-
derived.  A small collection of marginally-Latin forms concludes this group.

Croatian digraphs matching Serbian Cyrillic letters:  Unicode generally avoids encoding 
digraphs and other multiple letterforms, but an exception is made for the case of Serbocroatian, 
which is a single language with paired alphabets in Latin script (Croatian) and Cyrillic script 
(Serbian).  In this unique case, direct one-to-one character transliteration is a reasonable ideal, 
and this set of digraph codes is provided for this purpose.  The appropriate cross-references are 
given for the lowercase letters.  One problem with digraph codes is that there are two potential 
uppercase forms, depending on whether only the initial letter is to be capitalized, or both (for 
the case of all-caps).  Unicode does not in itself aim to provide any solution of this problem for 
software that transliterates between Croatian and Serbian.

Pinyin diacritic-vowel combinations:  The PRC standard GB 2312-80 provides a set of codes for 
the "Pinyin" Latin transcription of Mandarin Chinese.  Most of the letters used in Pinyin 
romanization, even those with diacritical marks, are already covered in the preceding Latin 
blocks.  The rather exceptional group of eight codes here is provided in order to cover the 
remaining Pinyin combinations specified in GB 2312-80.

Additions and currently unassigned:  The unassigned space for this block is made unusually 
large, on the supposition that the Latin script is the most widely used in the world, and hence 
will be subject to the most extensions for various purposes in the future.

Case pairs:  A number of characters in this block are uppercase forms of characters whose 
lowercase form is included in some other grouping.  Most often, this occurs with characters that 
originated as members of the International Phonetic Alphabet which, when adopted into the 
Latin-based script of a real language, acquire a novel uppercase form.  Occasionally alternative 
uppercase forms arise by this process.  If usage information indicates that two uppercase forms 
are merely minor glyphic variants of the same form, they are given a single code, as for U+01B7 
LATIN CAPITAL LETTER YOGH .  If usage information indicates that two uppercase forms are 
acutally used differentially, then they are given dfferent codes, as for U+018E LATIN CAPITAL 
LETTER TURNED-E  vs. U+018F LATIN CAPITAL LETTER SCHWA.  In the latter event, the lowercase 
form is cloned U+01D5 LATIN SMALL LETTER TURNED E, clone of U+0259 LATIN SMALL LETTER 
SCHWA), so as to enable unique case-pair mappings if desired.

Languages:  Some indication of language or other usage is given for most characters, but this 
information is by no means to be regarded as exclusive. 

Author accepting responsibility is
:____Joe Becker_________________




Standard Phonetic U+0250-02AF 

The "Standard Phonetic" block contains primarily the unique symbols of the International 
Phonetic Alphabet (IPA), which is a standard system for indicating specific sounds.  The IPA 
was first propounded in 1886, and has undergone occasional revision of content and of usage 
since that time.  Unicode covers all single symbols and all non-ligature alternates in the last 
published IPA revision (1979).  The use of diacritical marks for close phonetic transcription is an 
integral part of IPA, as is the use of small "modifier letters" (the IPA diacritics and modifiers are 
encoded in the two blocks following this one).  A few symbols have been added to this block 
that are peculiar to IPA-derived transcriptional practices among Sinologists, Americanists, et al.  
Note also that a few non-standard or obsolete phonetic symbols are encoded in the block 
preceding this one (Extended Latin).

Unifications:  IPA includes the entire lowercase Latin alphabet a-z, a number of extended Latin 
letters (e.g. U+0153 LATIN SMALL LETTER COMBINATION O E), and a few Greek letters.  The question 
of whether these characters are "the same" when used in an IPA context, or whether all IPA 
forms should be considered as a separate unique alphabet, has many reasonable arguments on 
both sides.  Ultimately, Unicode was designed so as to unify the IPA symbols as much as 
possible with other letters (although not with non-letter symbols such as U+222B INTEGRAL SIGN 
or U+2299 DIRECT PRODUCT).  A primary reason, aside from reduced duplication, is that the IPA 
symbols have become adopted into Latin scripts for many languages (e.g. in Africa).  There 
seems to be no merit in the futile attempt to distinguish a "transcription" from an "actual script" 
in such cases.  The result is that several IPA symbols are found in ranges other than this block.  
Apart from the Latin alphabet, these are cross-referenced at the beginning of the names list.

IPA alternates:  In a few cases where standard IPA practice has evolved alternate forms, e.g., 
U+0269 SCRIPT-I "i" versus U+026A SMALL CAP -I "I", Unicode provides separate encodings for the 
two alternates.

Case pairs:  IPA does not sanction case distinctions, so in effect its phonetic symbols are all 
lowercase.  When IPA symbols are adopted into the "actual script" of a language, as for example 
has occurred in Africa, they acquire uppercase forms.  Since these uppercase forms are not 
themselves IPA symbols, they are encoded in the block preceding this one (Extended Latin), and 
cross-referenced with the IPA names list.

Typographic variants:  IPA includes typographic variants of certain Latin letters, which would 
ordinarily be considered variations of font style rather than of character identity, e.g. "script" or 
"small cap" letterforms.  These forms are encoded as separate characters in Unicode so that all of 
IPA may be encompassed within a single font.  Unicode also separately encodes the unique IPA 
typographic variant of the Greek letter "phi", as well as the borrowed letter Greek "iota" which 
has a unique Latin uppercase form.

Diacritical marks:  Unicode presumes the necessity of dynamically-applied (so-called "floating") 
diacritical marks, which happen to be an essential element of IPA orthography.  In Unicode, all 
diacritical marks are encoded in sequence after the base character to which they apply.  For more 
details, see the section on diacritical marks.

Standards:  2nd DP ISO 10646:  Unicode covers the phonetic characters contained in 2nd DP ISO 
10646 (which are taken from the Xerox Character Code Standard).  The Xerox/10646 set 
considers IPA forms to be a separate alphabet, so the Latin alphabet a-z and other symbols are 
duplicated there.  Although 10646 rejects the use of applied diacritical marks for Latin letters, it 
provides such marks for the equivalent letters in IPA. Unicode diacritics are for general 
application; Unicode does not duplicate the Latin alphabet a-z in IPA.

Encoding structure:  The Standard Phonetic block is arranged in approximate alphabetical order 
according to the Latin letter that is graphically most similar to each symbol.  This has nothing to 
do with a phonetic arrangement. 

Author accepting responsibility is
:___Ken Whistler__________________




Modifier Letters U+02B0-02FF 

Modifier Letters are an assorted collection of small signs that are used generally to indicate 
modifications of the preceding letter, although a few may modify the following letter, and some 
may serve as independent letters.  These signs are distinguished from "diacritical marks" in that 
modifier letters are treated as free-standing spacing characters.  They are distinguished from 
similar- or identical- appearing punctuation or symbols by the fact that the members of this 
block are considered to be letter characters that do not break up a word.  The majority of these 
signs are phonetic modifiers, including the requirements of the International Phonetic Alphabet 
(IPA).

Phonetic usage:  In phonetic usage these signs are sometimes called "diacritics", which is correct 
in the logical sense that they are modifiers of the preceding letter.  However, in Unicode, the 
term "diacritical marks" refers specifically to non-spacing applied marks, whereas the codes in 
the current block specify spacing characters.  For this reason, many of the "Modifier Letters" in 
this block correspond to separate "diacritical mark" codes which are cross-referenced in the 
names list.  Modifier letters have relatively well-defined phonetic interpretations.  Their usage is 
generally to indicate a specific articulatory modification of a sound represented by another 
letter, or to convey a particular level of stress or tone, etc.  The modifier letters in Unicode are 
collated from a variety of sources, the most important of which is the IPA.

Glyphic encoding:  Despite Unicode's general policy of encoding characters, not glyphs, 
Unicode takes a "glyphic" approach to encoding the Modifier Letters.  In this character set there 
exist different characters for the same "semantic", and there exist different "semantics" attributed 
to the same character in different contexts.  For example, the signs U+02BC, U+02BE, U+02C0 
have all been used in various publications as a Latin transliteration of the glottal stop (Arabic 
"hamza"), while at least U+02BC has other usages as well.  The intention of the Unicode 
encoding is not to resolve the variations in usage, but merely to supply implementors with a set 
of useful forms to choose from.  The list of usages given for each character should not be 
considered exhaustive.

Encoding structure:  The Unicode block for Modifier Letters is divided into the following 
relatively arbitrary ranges:
	U+02B0-02B8:  	Phonetic modifiers derived from Latin letters
	U+02B9-02D7: 	Miscellaneous phonetic modifiers
	U+02D8-02DB:  	Spacing clones of diacritics

Latin superscripts:  Graphically, some of the phonetic modifier signs are raised or 
superscripted, some are lowered or subscripted, and some are vertically centered.  The raised 
signs that derive from Latin letters might suggest the superscripting of the entire Latin alphabet, 
but the intention here is to encode only those few forms that have specific usage in IPA or other 
major phonetic systems.  Unicode does not in general provide separate codes for superscripted 
or subscripted characters (although exception is also made for a limited set of numeric forms to 
preserve one-to-one mapping with other prominent standards).

Spacing clones of diacritics:  Some corporate standards distinguish spacing and non-spacing 
forms of diacritical accent marks, and Unicode provides matching codes for these 
interpretations when practical.  The majority of the spacing forms are covered in Unicode block 
"Latin1" (derived from ISO 8859/1).  The four common European diacritics which do not have 
encodings in ISO 8859/1 are added as spacing characters in the current block.  Since the 
encoding in this block is glyphic, these forms may be used with any suitable interpretation (e.g. 
U+02D9 SPACING DOT ABOVE as an indicator of Mandarin Chinese fifth tone). 

Author accepting responsibility is
:______Ken Whistler_______________




Generic Diacritical Marks U+0300-03FF 

The application of "Diacritical Marks" constitutes the fundamental extension mechanism for the 
Greek family of scripts (preeminently Latin, Cyrillic, and Greek).  The diacritical marks in this 
block are intended for generic use with any of these scripts, or even more generically, with any 
script if desired.  In addition to the marks in this block, other diacritics specific to some 
particular script are encoded along with the alphabet for that script.  Another block of diacritical 
marks, primarily used with symbols, is defined in code range U+20D0-20FF.  The allocation of a 
diacritic to one block or another is merely a matter of perceived appropriateness; it is not 
intended to define or limit the range of characters to which a particular mark may be applied.

Semantics of the "Diacritic" character property:  The annotation of a Unicode character as a 
"Diacritic" (or its occurrence in the present block), and its depiction with relation to a dashed 
circle, constitute an assertion that this character is intended to be applied via some process to an 
associated character called the "base character" or "baseform".  When rendered, the diacritical 
marks characters are intended to be attached to the preceding base character in some manner, 
and not to occupy a spacing position by themselves.  These marks may therefore be called "non-
spacing" or "floating" marks.

Marks as spacing characters:  By convention, Unicode diacritical marks may be exhibited in 
(apparent) isolation by applying them to the SPACE character U+0020.  Also, Unicode 
separately encodes clones of most diacritical marks that are spacing characters, largely to 
provide compatibility with existing character sets.  These related characters are cross-
referenced.

Sequence order of base character and diacritcal mark:  In Unicode, all diacritcal marks 
are intended to be encoded in sequence ***after*** the base characters.  Please note that 
this convention is different from the convention in standard ISO 6937 and other old 
standards.  The Unicode sequence U+0061 "a" , U+0308 "(", U+0075 "u" unambiguously 
encodes " ", not "|".  The reason for the old convention was conformity with "dead keys" 
on mechanical typewriters, which is no longer a consideration for computers.  The 
reason for the Unicode convention is consistency with the logical order of vowel 
"points" in Semitic and Indic scripts. In those scripts, diacritics logically follow their base 
characters.

Sequence order of multiple diacritcal marks:  In case of multiple diacritcal marks applied to 
the same base character, if the result is unambiguous there is no reason to specify a sequence 
order for the mark characters.  In the relatively rare cases where a standard sequence order of 
multiple marks is necessary, that order should be left-to-right, inside-outward.

Double diacritics:  A few marks are depicted with two dashed circles; such marks apply to the 
two characters preceding them in the text stream.

Spelling of marked combinations:  Since Unicode contains codes U+0075 "u", U+0308 
"DIAERESIS," and also U+00FC "U WITH DIAERESIS", there are potentially two distinct sequences 
that both spell the letter "|" U WITH DIAERESIS.  The same problem exists for several dozen other 
Latin baseform-diacritic combinations.  Unicode recognizes that it is futile to prohibit the 
formation or transmission of any sequence of characters.  The only workable solution is to 
require that any system or application desiring to enforce a standard spelling convention filter 
its own input stream.  Since only a relatively small number of marked combination letters have 
independent Unicodes (for backward compatibility, see introduction to Latin1), requiring such 
filters should not pose major problems.

Standards:  The handling of diacritcal marks is currently a hotly-debated issue among different 
standards groups, since every potential solution has a high cost to some portion of the 
computing community.  Diacritcal marks are treated with a great deal of inconsistency among 
current standards and even within some standards.  The Unicode solution recognizes the 
fundamental necessity of "floating" diacritics, and for consistency encourages the treatment of 
all diacritics as floating. At the same time it provides for compatibility mappings with the major 
standards that have adopted other solutions.  It should be repeated, however, that the Unicode 
sequence order of base-character-preceding-diacritcal-mark is different from the convention in 
ISO 6937 and others to reduce ambiguity  through greater consistency.

Glyphic encoding:  Because the generic diacritical marks have such a wide variety of 
applications, the encoding in this block is intentionally "glyphic" rather than "semantic".  Thus, 
there are cases of several different semantics for the same Unicode, e.g. U+0308 "= diaeresis = 
umlaut = double derivative.  And there are cases of several different Unicodes for the same 
semantic, e.g., variants of "cedilla" include at least U+0312, U+0326 , and U+0327.  Some 
diacritical marks are applied across the body of the base character; Unicode is more liberal 
about assigning independent codes to combination letters involving these marks since it is less 
obvious that they are separable from the basic structure of the letter.

Encoding structure:  The Unicode block for generic diacritcal marks is divided into the 
following ranges:

	U+0300-0332:  	Ordinary diacritics
	U+0333-0337:  	Overstruck diacritics
	U+0338-033C:  	Double diacritics
	U+033D-036F:  	Currently unassigned 

Author accepting responsibility is
:_____Ken Whistler________________




Greek U+0370-03FF 

The Greek script is used for writing the Greek language, and (in an extended variant) for the 
Coptic language.  Greek is ancestral to the family of scripts including Latin and Cyrillic.  In this 
family, the main peculiarity is the occasional use of diacritical marks.

Standards:  The ECMA registry under ISO 2375 for use with ISO 2022 contains many Greek 
subsets.  Unicode is based on the latest and most prominent of these: ISO 8859/7, which equals 
the Greek national standard ELOT 928, and also ECMA-118.

ISO 8859/7:  Unicode encodes Greek characters in the same relative positions as in 
8859/7.  Generic punctuation characters (17 of them) are unified with characters in other 
Unicode ranges; cross-references to such codes are given in italics below.

2nd DP ISO 10646:  For the basic Greek set, 2nd DP ISO  10646 follows the 
arrangement of 8859/7, but it replaces many of the generic punctuation 
characters with various diacritics and combinations.  Of these, only U+0370 
"GREEK IOTA BELOW" is retained in the Greek section of Unicode; the others may 
be spelled with other Unicodes.  2nd DP ISO 10646 also contains dozens of 
baseform-diacritic combinations, which in Unicode are sequences, not single 
characters.

ISO 5428-1980:  A number of variant and archaic characters are taken into Unicode from 
this bibliographic standard.

Diacritical marks:  In Unicode, diacritical marks are spelled as separate characters occurring 
after the baseform character in text sequence.  In general, Unicode regards baseform-diacritic 
combinations as sequences represented via composition, which do not receive separate codes.  
However, the baseform-diacritic combinations that are in 8859/7 are retained for compatibility.

Several diacritical marks may be used with Greek that are not included in 8859/7.  These are 
found in the Generic Diacritical Marks range:

	U+0300, 0301, 0303, 0304
	U+0306, 0308, 0313, 0314

Since the marks in this range are encoded by shape, not by meaning, they are appropriate for 
use in Greek where applicable.  Multiple diacritical marks applied onto the same baseform 
character are to be spelled as the baseform character followed by the several mark characters in 
sequence.  The order of diacritic characters is from the base form outward.)

Encoding structure:  The Unicode block for the Greek script is divided into the following 
ranges:

	U+0370-03CF:  		Mapping of the standard 8859/7
	U+03D0-03D6:  		Variant letterforms
	U+03D7-03D9:  		Punctuation-like characters
	U+03DA-03D1:  	Archaic letters
	U+03E2-03EF: 	 	Coptic-unique letters
	U+03E0-03FF:  		Currently unassigned
	
Variant letterforms:  Variant forms of Greek letters (sigma and beta) are encoded as separate 
characters in ISO 8859/7 and ISO 5428-1980, therefore this approach is taken in the Unicode set.

Greek letters as symbols:  A few of the Greek variants that are used primarily as technical 
symbols are placed in this range since they are clearly forms of Greek letters.  In some cases, 
however, Greek letters borrowed into symbol usage may be said to have acquired separate 
identities, e.g. U+2126 "W"OHM SIGN vs. U+03A9 "W"GREEK CAPITAL LETTER OMEGA, or U+00B5 "m" 
MICRO SIGN vs. U+03BC "m" GREEK SMALL LETTER MU.  Despite identical glyphs, the semantic 
distinctions are so great that these characters are assigned separate codes which are cross-
referenced to distinguish them. 

Punctuation-like characters:  The question of which punctuation-like characters are "uniquely 
Greek" and which ones can be unified with generic Western punctuation has no definitive 
answer.  The Greek question mark U+03D7 ";" was retained for use by systems which treat it as 
a sentence-final punctuation in distinction from the semicolon.

Archaic letters:  Archaic letters have been retained from ISO 5428-1980, since there are only a 
few of them.  Their lower-case forms also occur in 2nd DP ISO 10646.

Coptic-unique letters:  The Coptic script is regarded as a font/style variant of the Greek 
alphabet.  The letters unique to Coptic have been added, since there are only a few of them.  
Their lower-case forms (except one) also occur in 2nd DP ISO 10646.  A complete Coptic set 
would be obtained by rendering the whole Greek alphabet in that same style. 

Author accepting responsibility is
:____Joe Becker_________________



Cyrillic U+0400-048F

The Cyrillic script a member of the Greek family of scripts.  Cyrillic has traditionally been used 
for writing various Slavic languages, among which Russian is now predominant.  In recent 
years, Cyrillic has been extended for representing non-Slavic minority languages of the Soviet 
Union.

The Cyrillic script is well-behaved, its main peculiarity being the occasional use of diacritical 
marks.  Cyrillic letters come in uppercase/lowercase pairs.

Standards:  The ECMA registry under ISO 2375 for use with ISO 2022 contains several Cyrillic 
subsets.  Unicode is based on the latest and most prominent of these: ISO 8859/5.  The old 
Soviet standard for Russian only, GOST 13052-67, appears to be being overtaken by ISO 8859/5.

GOST 13052-67:  ("GOST" stands for "Government Standard".)  The old Soviet standard 
fails to encode even the full Russian alphabet (omitting # and #).  The Russian letters it 
does contain are encoded in order of their ASCII phonetic counterparts, not in the order 
of the Russian alphabet (presumably to enable automatic approximate transliteration).  
This approach is so counter-intuitive that no other standard follows this approach to the 
Russian alphabet.

ISO 8859/5:  Unicode encodes Cyrillic characters in the same relative positions as in 
8859/5.  Generic punctuation characters (4 of them) are unified with characters in other 
Unicode ranges; cross-references to such codes are given in italics below.

2nd DP ISO 10646:  For the basic Cyrillic set, 10646 follows the arrangement of 8859/5.  
But 10646 also contains dozens of baseform-diacritic combinations, which in Unicode 
are represented by character sequences, not single characters.

Diacritical marks:  In the Unicode design, diacritical marks are spelled as separate characters 
occurring after the baseform character in text sequence.  In general, Unicode regards baseform-
diacritic combinations as sequences represented via composition, which do not receive separate 
codes.  However, all of the baseform-diacritic combinations in 8859/5 are retained for 
compatibility.

Furthermore, letterforms that might be considered as baseform-diacritic combinations but 
where the mark appears integral to the body of the letter are encoded as independent characters, 
in order to avoid dispute over whether these letters have marks or protrusions.  The majority of 
the Extended Cyrillic characters fall into this category.  Also, a few idiosyncratic combinations 
used in archaic Cyrillic are encoded whole because the diacritics are not productive.

The only inseparable diacritical marks unique to Cyrillic are for Extended Cyrillic, and these are 
subject to wide typographic variability.  In particular, there is a generic protrusion of the lower-
right corner of a letter, which apparently originated as a generalization of the addition to 
U+0448 "sha #" that produces U+0449 "shcha #".  This entity appears in many different graphic 
renditions; the ISO standard character names refer to it erroneously as "CEDILLA".

Unifications:  The "Cyrillic" block of Unicode contains letters of various origin, most of them 
clearly from Greek, a few from Hebrew (U+0448 "sha #" from U+05E9"shin #"), and some 
misleading (U+0455 Old Cyrillic zelo "S" not obviously from U+0073 Latin "S").  To avoid 
unnecessary chaos, Unicode regards all these letters as having established separate Cyrillic 
identities for themselves over the many centuries.  In contrast, the recently-created alphabets 
including "Extended Cyrillic" characters for Soviet minority languages are very far from well-
established.  Latin characters included in those alphabets (e.g. "q" and "w" for Kurdish, or 
U+0292 # "yogh" for Abkhasian) are not given unique Cyrillic encodings.

Languages:  The language(s) using a given character are noted in cases where this information 
was thought to be helpful (such annotation is given only after the lowercase form, to avoid 
needless repetition).  If such an annotation ends with an ellipsis "...", then the language(s) cited 
are merely the principal one(s) among many.  If the annotation does not end with an ellipsis, then 
the cited list is thought to be complete.

Glagolitic:  Glagolitic is a script originally related to Cyrillic, but the history of the creation of 
the scripts and their relationship has been lost.  Unicode regards Glagolitic as a separate script 
from Cyrillic, not as a font change from Cyrillic.  This is primarily because Glagolitic appears 
unrecognizably different from Cyrillic, and secondarily because Glagolitic has not grown to 
match the expansion of Cyrillic.  Since Glagolitic is essentially extinct, it is not encoded in the 
current draft of Unicode, but is expected to be in the future.

Encoding structure:  The Unicodes for the Cyrillic script are divided into two adjacent blocks 
"Cyrillic" and "Extended Cyrillic", which have the following ranges:
	U+0400-045F:  	Mapping of the standard 8859/5
	U+0460-0481:  	Archaic letters
	U+0482-048F:  	Archaic miscellaneous
	U+0490-04C0:  	Extended Cyrillic
	U+04C1-04FF:  	Currently unassigned

Archaic letters:  The archaic form of the Cyrillic alphabet is regarded as a font change from 
modern Cyrillic, because the archaic forms are relatively close to the modern appearance and 
because some of them are still in modern use in languages other than Russian (e.g.,U+0406 Old 
Cyrillic "I" used in modern Ukrainian and Byelorussian).  Since the archaic letters outside of 
8859/5, i.e. those in columns U+046 through U+048, rarely occur in modern form, those letters 
are shown in the charts in an archaic font.  A complete Old Cyrillic set would be obtained by 
rendering the whole "Cyrillic" section, i.e., columns U+040 through U+048, in that same style.


Extended Cyrillic U+0490-04FF

Extended Cyrillic:  These are the baseforms used in alphabets for minority languages of the 
Soviet Union.  The order of these letters follows 2nd DP ISO 10646 and is based (very crudely) 
on graphic similarity to Russian letters, not on phonetic values.  Note that the scripts of some 
Soviet minority languages have often been revised in the past; Unicode includes only the 
alphabets in current use, not the rejected old letterforms. 

Author accepting responsibility is
:_____________________



Georgian U+0500-052F 

The Georgian script is used primarily for writing the Georgian language.  The script is very 
well-behaved, lacking even diacritical marks and uppercase/lowercase pairs.

Archaic script form:  The modern Georgian script is a style called MKHEDRULI (soldier's), which 
originated as the secular derivative of a form called KHUTSURI (ecclesiastical) that did have 
uppercase/lowercase pairs.  Since KHUTSURI is essentially extinct, it is not encoded in the current 
draft of Unicode, but it may be in the future.

Standards:  2nd DP ISO 10646:  Unicode departs from the 10646 arrangement for Georgian.  In 
Unicode, the archaic letters are placed together in a group after the modern letters.  In 10646, 
these two groups of letters are sorted together (in an order that is open to question).

Encoding structure:  The Unicode block for the Georgian script is divided into the following 
ranges:

	U+0500-0520:  	Modern alphabet
	U+0521-0526:  	Archaic letters
	U+0527-052A:	Currently unassigned
	U+052B:  	Punctuation
	U+052C-052F:  	Currently unassigned 

Author accepting responsibility is
:_____________________



Armenian U+0530--058F 

The Armenian script is used primarily for writing the Armenian language.  The script is very 
well-behaved, lacking even diacritical marks (although see below).  It does have 
uppercase/lowercase pairs.

Standards:  2nd DP ISO 10646:  Unicode follows the 10646 arrangement for Armenian.  Based 
on general policies, Unicode omits two digraphs and a ligature found in 10646.  The character 
that 10646 encodes as GEORGIAN FULL STOP is encoded in Unicode as ARMENIAN FULL STOP, since 
its modern usage is more common in Armenian than in Georgian.

Encoding structure:  The Unicode block for the Armenian script is divided into the following 
ranges:

	U+0530:	Currently unassigned
	U+0531-0556:  	Uppercase letters
	U+0557-0558:	Currently unassigned
	U+0559-055F:  	Modifier letters
	U+0560:	Currently unassigned
	U+0561-0586:  	Lowercase letters
	U+0587-0588	Currently unassigned
	U+0589:  	Punctuation
	U+058A-058F:  	Currently unassigned

Modifier letters:  The small marks in the group called Armenian modifier letters are sometimes 
said to be placed "above" the alphabetic letters of the words to which they apply, but in modern 
Armenian typography they are quite uniformly placed above and to the right, so that they actually 
occupy a letter position of their own.  Therefore, in Unicode these objects are treated as spacing 
letters rather than as non-spacing diacritical marks. 

Author accepting responsibility is
:_____Joe Becker________________




Hebrew U+0590-05FF

The Hebrew script is used for writing the Hebrew language, and also Yiddish and Ladino.  
Vowels and various other marks are written as "points" applied to consonantal base letters; in 
normal writing these points are omitted.  The script is written from right to left (the only other 
right-to-left script currently encoded in Unicode is Arabic).

Final (contextual variant) letterforms:  Variant forms of five Hebrew letters are encoded as 
separate characters in all Hebrew standards, therefore this practice is followed in the Unicode 
standard.

Right-to-left directionality:  The means of indicating right-to-left text directionality is still a 
hotly-debated topic (see separate discussion), but this debate has little effect on the selection and 
designation of the characters themselves.  In fact, there appears to be widespread agreement on 
the only substantive encoding correlate of directionality:  The punctuation marks used with the 
Hebrew script are not given independent codes (i.e., are unified with Latin punctuation), except 
for the few marks that are unique to Hebrew.

Standards:

ISO 8859/8:  Unicode encodes the Hebrew alphabetic characters in the same relative 
positions as in 8859/8; however, there are no points or Hebrew punctuation characters 
in this standard.

2nd DP ISO 10646:  Unicode follows the basic arrangement of 10646, as modified by the 
comments on 10646 supplied by the Standards Institution of Israel.

Encoding structure:  The Unicode block for the Hebrew script is divided into the following 
ranges:

	U+0590-05AF:  	Cantillation marks, accents
	U+05B0-05CF:  	Points and punctuation
	U+05D0-05EF:  	Mapping of ISO 8859/8
	U+05F0-05F2:  	Yiddish digraphs
	U+05F3-05F4:  	Additional punctuation
	U+05F5:	Additional point
	U+05F6-05FF:  	Currently unassigned

Points and cantillation accents:  These marks, generically called "points", indicate vowels or 
other modifications of consonant letters.  The occurrence of a character in the "Cantillation 
accents" or "Points and punctuation" range, depicted with relation to a dashed circle, constitute 
an assertion that this character is intended to be applied via some process to the character that 
precedes it in the text stream, this being called the "base character".  These marks may therefore 
be called "non-spacing" or "floating" or "flying".  When rendered, these characters are intended 
not to occupy a spacing position by themselves.  By convention, such marks may be exhibited in 
(apparent) isolation by applying them to the SPACE character U+0020.  Unicode does not 
specify a sequence order in case of multiple marks applied to the same base character, since 
there is no possible ambiguity of interpretation.

Cantillation accents:  These marks are used to indicate chanting of sacred texts.  There are 
several systems of such accents; current standards encode the Tiberian system.  The literature 
contains great variability in the relationship between the names of these accents and their 
graphic forms.

Points and punctuation:  A few of these marks are placed "after" (to the left of) their base 
characters.  In these cases Unicodes treats them as ordinary spacing characters. 

Author accepting responsibility is
:_____Joe Becker________________



Arabic/Extended Arabic U+0600-06FF

The Arabic script is used for writing the Arabic language, and has been extended for 
representing a number of other languages both major and minor: Persian, Urdu, Pashto, Sindhi, 
Kurdish, etc.  Some languages which formerly used the Arabic script now employ the Latin or 
Cyrillic scripts: Indonesian/Malay, Turkish, Ingush, etc.

The Arabic script is cursive even in its printed form, so that as in the handwritten tradition, the 
same letter may be written in many different forms depending on how it joins with its 
neighbors.  Vowels and various other marks are written as "points" applied to consonantal base 
letters; in normal writing these points are omitted.  The script is written from right to left (the 
only other right-to-left script currently encoded in Unicode is Hebrew).

Semantic encoding:  The basic Arabic alphabet is relatively well-defined (at least, the basic 
consonants), and each letter receives only one Unicode value, no matter how many different 
contextual appearances it may exhibit in text.  Each Unicode may be said to represent the 
abstract character itself, or the inherent semantic identity of the letter.  A word is spelled as a 
sequence of abstract letters, i.e. as a sequence of Unicodes.  The task of converting such a 
spelling to a visual form, and the graphic fragments used to compose such a visual form, are 
matters external to character encoding.  The graphic form shown in the Unicode chart for an 
Arabic letter (usually the form of the letter when standing by itself) is not the identity of that 
Unicode, but rather a mere reminder of the abstract letter it represents.

Right-to-left directionality:  The means of indicating right-to-left text directionality is still a 
hotly-debated topic (see separate discussion), but this debate has little effect on the selection and 
designation of the characters themselves.  In fact, there appears to be widespread agreement on 
the only substantive encoding correlate of directionality:  The punctuation marks used with the 
Arabic script are not given independent codes (i.e. are unified with Latin punctuation), except for 
the few cases where the mark has a significantly different appearance in Arabic, namely: 
U+060C # comma, U+061B # semicolon, U+061F # question mark, U+066A # percent sign.

Standards:

ISO 8859/6 = ECMA-114 = ASMO 449:  There is a relatively well-established standard encoding 
for Arabic; Unicode therefore places the basic Arabic characters in the standard relative 
positions as this standard.  This Arabic standard order is worth adhering to despite foibles such 
as the remarkable gap this leaves in the alphabet (U+063B-0640) and the omission of all 
"extended" Arabic letters needed for other languages in this family.

2nd DP ISO 10646:  10646 follows the arrangement of 8859/6 for the basic Arabic 
characters.  It also contains other Arabic forms scattered with no obvious logic into three 
different areas: extended Arabic letters, digits, and "presentation forms".  Unicode 
includes the "extended" letters and digits because they are needed for other languages in 
the family, but, as a rule, does not encode Arabic "presentation forms" because they are 
not characters.

Encoding structure:  Unicodes for Arabic scripts are divided into two adjacent blocks "Arabic" 
and "Extended Arabic", which have the following ranges:

	U+0600-064A:  	Basic Arabic characters as mapped in ISO 8859/6
	U+064B-065F:  	Points from 8859/6
	U+0660-066F:  	Extended Arabic: "Indic" digits
	U+0670:  	Extended Arabic: Additional point
	U+0671-06D4:  	Extended Arabic letters
	U+06D5-06FF:  	Currently unassigned

Points:  Points are marks that indicate vowels or other modifications of consonant letters.  The 
occurrence of a character in the "Points" range, and its depiction with relation to a dashed circle, 
constitute an assertion that this character is intended to be applied via some process to the 
character that precedes it in the text stream, this being called the "base character".  These marks 
may therefore be called "non-spacing" or "floating" or "flying".  When rendered, these characters 
are intended not to occupy a spacing position by themselves.  By convention, such marks may be 
exhibited in (apparent) isolation by applying them to the SPACE character U+0020.  Unicode does 
not specify a sequence order in case of multiple marks applied to the same Arabic base 
character, since there is no possible ambiguity of interpretation.

"Indic" digits:  The "Indic" digits are those used in conjunction with the Arabic script (the term 
"Indic" is used to avoid the ambiguity of the term "Arabic digits").  Unicode assigns separate 
codes to the digits of each script, just as it does to the letters of each script.  The Persian and 
Urdu variant digits are given separate codes under the principle of  "glyphic coding," discussed 
below.

Extended Arabic letters:  The set of letters encoded in this section unavoidably contains 
spurious forms.  The Arabic script has been extended for some relatively obscure languages (e.g. 
Baluchi, Lahnda) which have little tradition in printed typography.  Although the available 
information on variant handwritten forms is sporadic and inconsistent, it is clear that in many 
cases the extended letters for obscure languages overlap with the well-defined character 
extensions used by major languages like Persian (Farsi) and Pashto.

In this situation of imperfect information, Unicode adopts a "glyphic" approach to the baseform 
letters and variant digits in the Extended Arabic block.  There are often different characters for 
the same "semantic" (or sound), and different "semantics" (or sounds) attributed to the same 
characters by different languages.  The best we can do is to supply a superset of the various 
characters to choose from; codes that are not needed (and/or regarded as invalid) should 
simply be ignored.  Given imperfect information and the risk of omitting valid characters, this 
approach was felt to be the most practical.   Within this framework, however, the graphic form 
shown in the Unicode chart for an Extended Arabic letter remains merely the stand-alone form 
of the abstract letter, just as in the chart of the basic Arabic alphabet.

The names given to extended Arabic characters are entirely artificial, intended only to create 
unique identifiers.  The language(s) using a given character are indicated, even though this 
information is incomplete.  When such an annotation ends with an ellipsis "...", then the 
languages cited are merely the known principal ones among many. 

Plurals in Farsi

Subject: Re:  Arabic languages - Algorithmic shaping
Cc: fortran@ibm.com, khan@btc.kodak.com

	>>If this is the issue referred to, then the only problem is
	>>determining whether in Farsi the correct typography would be to
	>>separate off the plural suffix with a normal space in rendering or
	>>with a thinspace.  The contextual shaping of the individual glyphs
	>>is otherwise perfectly regular.  It is a separate issue to determine
	>>the correct and expected UI for entering Farsi which has this
	>>typographical behavior.

The solution suggested above for handling the plural suffix is an
acceptable solution. The plural suffix is often written after the word
with very little space in between the two. The use of a thin space to separate
the two would be okay, and should be preferred becasue it keeps the
context analysis algorithm perfectly regular. Further more, such separation 
between the plural suffix and the word is not a universal practice. 
There are many instances when the plural is joined with the rest of the word
using regualr joining rules. An example would be the plural for the word
"shub" which means night, and whose plural is "shub-ha" which can be, and is
written both ways. Thus such usage is more correctly a typographic refinement
and should be user selectable through the appropriate UI with provisions for 
entry of various types of spacing elements.
The use of thin spaces is also needed because in some cases one wants to 
force isolated forms of the characters in a word, and this is the only 
way to do it correctly.
 



Author accepting responsibility is
:______Joe Becker_______________



Ethiopian U+0700-081F

The Ethiopian script is used for writing several languages of the area, including Amharic, Tigre, 
and Oromo.  The script, which is based on the writing of a dead language Ge'ez, is graphically 
well-behaved.  However, it is a syllabary rather than an alphabet, which has several encoding 
consequences discussed below.

Array structure:  The basic Ge'ez syllabary is traditionally arranged as an array of 33 consonant 
initials crossed with 7 vowel finals.  Since most of the consonants also take a labialized final, this 
can be expanded to a 33 x 8 array, which is ideal for encoding.  This orderly array forms the 
basis for the Unicode "Ethiopian" block; other characters are added afterward in a less 
systematic fashion.

Standards:  2nd DP ISO 10646:  The 10646 arrangement for Ethiopian is also derived originally 
from the 33 x 8 syllabic array, but in 10646 this array is destroyed by the impossibility of forcing 
it into a "graphic character set" structure of 94 codes.

Encoding structure:  The Unicodes for the Ethiopian script are divided into two adjacent blocks 
"Ethiopian" and "Extended Ethiopian", which have the following ranges:

	U+0700-0807:  	Basic Ge'ez syllabary
	U+0808-081B:  	Numbers
	U+081C-081F:  	Punctuation
	U+0820-082F:  	Variant letters
	U+0830-0832:  	Additional punctuation
	U+0833-083F:  	Diacritical marks
	U+0840-089E:  	Additional letters
	U+089F-08FF:  	Currently unassigned

Variant letters:  These are common but unsystematic variants of letters in the syllabic array.

Diacritical marks:  The Ethiopian syllabic letterforms in most cases reveal their origin as 
composites of a consonant base character plus a vowel diacritical mark, with labialization 
represented by a further diacritical mark.  In Unicode the syllabic letters are represented as 
whole codes, rather than by composition, because the composites have truly become the units of 
the script (and besides, the compositional rules are very irregular).  However, a syllabary is 
more difficult to extend than an alphabet, and there may be merit in accomplishing some 
extensions via the application of diacritical marks.  The few marks in this range appear to be the 
most productive in producing extensions, and are provided in case there is a desire to use them 
in this fashion.


Extended Ethiopian U+0820-08FF

Extended Ethiopian letters:  This group includes some extensions of the basic syllabary, plus a 
set of labialized series that is now part of the standard script (and which in some cases replicates 
syllables in the main array).  The characters are arranged according to the same N x 8 scheme as 
the main array.  The names given to the extended Ethiopian characters are somewhat artificial, 
intended mainly to create a unique identifier.

The Ethiopian script has been extended for some relatively obscure languages which may have 
little tradition of printed typography, and obsolete alternative forms of some letters also exist.  
The available information on variant letter forms is often sporadic and inconsistent, so some of 
the codes may be regarded as unneeded (and/or invalid) for some applications.  It is assumed 
that the encoding of various languages will make use of various different subsets of these 
extensions.

Given the imperfection of information and the bulkiness of extensions to a syllabary, the 
currently unassigned range has been made larger for Ethiopian than for other scripts (enough 
singly-attested forms have already been collected to fill it). 


Author accepting responsibility is
:_____________________




Devanagari U+0900-097F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Bengali U+0980-09FF

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Gurmukhi U+0A00-0A7F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Gujarati U+0A80-0AFF

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Oriya U+0B00-0B7F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Tamil U+0B80-0BFF

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Telegu U+0C00-0C7F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Kannada U+0C80-0CFF

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Malayalam U+0D00-0D7F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Sinhalese U+0D80-0DFF

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Thai U+0E00-0E7F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Lao U+0E80-0EFF

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Burmese U+0F00-0F7F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Khmer U+0F80-0FFF

Block introduction not yet written.
Author accepting responsibility is
:_____________________



Tibetan U+1000-107F

Block introduction not yet written.
Author accepting responsibility is
:_____________________

Mongolian U+1080-10FF
(to be defined)
Block introduction not yet written.
Author accepting responsibility is
:_____________________




General Punctuation U+2000-206F

General punctuation combines punctuation characters and character like elements used to 
achieve certain text layout effects.  The former contain punctuation which can be used with 
many different scripts.  Many general punctuation characters can also be found in the Unicode 
ASCII and Latin1 blocks. Punctuation felt to belong to a specific script is found in the block 
corresponding to that script, e.g. the Greek question mark U+03D7 ";" or the punctuation used 
with ideographs in the CJK Symbols block.

For decimal points and thousands separators, several encodings were supplied to provide 
applications with the ability to encode these either glyphically or semantically depending on 
their processing needs. (Latest version revised this, but I don't have the details yet./ed)


!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for the general punctuation is divided into the following 
ranges:
	U+2000-200A:	Typographical space characters
	U+200B-200F: 	Zero width layout characters
	U+2010-203E: 	Printing punctuation characters 
	U+203F-206F: 	Currently unassigned

Typographical space characters: These are encoded glyphically and allow fine control over the 
width of the space character.

Zero width layout characters: Occasionally it is desirable to indicate to software formatting text 
that adjacent characters do or do not run together, or in the case of mixed left-to-right right-to-
left nested text runs to disambiguate the direction of characters that do not carry an intrinsic 
directionality.  For this purpose Unicode provides zero width layout characters.  The Zero width 
space U+200B  acts just like any other space character, except that is has zero width.  The non-
joiner U+200C , if placed between e.g. f and i would prohibit the use of the "fi" ligature by the 
formatting software.   The joiner U+200D  has the opposite effect.  The left-to-right marker 
U+200E  and the right-to-left marker U+200F  can be used to override the formatting software's 
default decision about the directionality of a given character or text-run by providing a non-
printing character of a given directionality.

Except for their effect on the layout of the text in which they are contained these zero width 
layout characters can be treated just as any other character by the processing software; in 
particular they are not introducing a mode or state into the character sequence.  For non-layout 
text processing, such as sorting, searching etc. they can simply be filtered out.

Author accepting responsibility is
:____Ken Whistler_________________




Superscripts and Subscripts U+2070-209F 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for superscripts and subsscripts is divided into the 
following ranges:
	U+2070-2070: 	Superscript 0
	U+2071-2073: 	Reserved
	U+2074-207F: 	Superscript 
	U+2080-208E: 	Subscripts
	U+208F-209F: 	Currently unassigned

Author accepting responsibility is
:____Ken Whistler_________________



Currency U+20A0-20CF 

!! NOTE: Standards mention is tentative

This block contains currency symbols.  Other currency symbols are encoded in the ASCII and 
Latin1 blocks.


Encoding structure: The Unicode block for currency is divided into the following ranges:

	U+20A0-20A9: 	Currency Symbols
	U+20AA-20CF: 	Currently unassigned

Author accepting responsibility is
:_____________________



Diacritics U+20D0-20FF 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for diacritics is divided into the following ranges:
	U+20D0-20E1: 
	U+20E1-20FF: 	Currently unassigned

Author accepting responsibility is
:____Ken Whistler_________________




Letterlike Symbols U+2100-214F

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for letterlike symbols is divided into the following 
ranges:
	U+2100-2129: 	Letterlike symbols
	U+212A-214F: 	Currently unassigned

Author accepting responsibility is
:____Ken Whistler_________________



Number Forms U+2150-218F 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for the number forms pix is divided into the following 
ranges:
	U+2150-2152: 	Overstruck forms of digits
	U+2153-215f: 	Vulgar fractions
	U+2160-2182: 	Roman numerals and small roman numerals
	U+2183-218f: 	Currently unassigned

Author accepting responsibility is
:____Ken Whistler_________________



U+2190-21FF Arrows

!! NOTE: Standards mention is tentative

Glyphic encoding: Because the arrows have such a wide vriety of applications, the encoding of 
this block is intentionally "glyphic" rather than "semantic". Thus there may be several sematics 
for the same Unicode, e.g., U+2185 " " downward left arrow = carriage return.  And there are 
several essentially stylistic variants for each of the basic arrow forms.

Encoding structure: The Unicode block for arrows is divided into the following ranges:
	U+2190-21EA: 	Arrows
	U+21EB-21FF: 	Currently unassigned

Author accepting responsibility is
:_____________________




Mathematical Operators U+2200-22FF

!! NOTE: Standards mention is tentative

Mathematical operators are also found in the ASCII and Latin1 blocks.  In addition, symbols 
from the miscellaneous technical block, and characters from general punctuation are also often 
used for mathematical notation. Mathematical operators such as "implies"  and "if and only if" "" 
have been unified with the corresponding arrows in the arrows block (U+21D2, U+21D4 ).

Latin letters in special font styles, such as script P for the Weierstrass elliptic function U+2118, 
are to be found in the block letterlike symbols.  There are two Greek letters used for semantic 
units which are not part of the Greek block.  These are "micro" U+00B5  "m" in block Latin1 and 
the "Ohm sign" U+2126 "W"  in Letterlike symbols.  All other greek characters with special 
mathematical semantics have been unified with the Greek characters in the Greek block because 
their mathematical semantics do not distinguish them substantially from Greek letters.

Glyphic encoding: Because mathematics operators have such a wide variety of applications, the 
encoding of this block is intentionally "glyphic" rather than "semantic". There may be several 
sematics for the same Unicode, e.g. U+2218  circle bullet = composite function = APL jot.  And 
there are several essentially stylistic variants for many operators, e.g., U+2208  = U+220b  = 
U+228A;  all encode "is an element of." 

Encoding structure: The Unicode block for the mathemtics operators is divided into the 
following ranges:
	U+2200-22C3: 	Mathematics operators
	U+22C4-22FF: 	Currently unassigned

Author accepting responsibility is
:____Asmus Freytag_________________




Miscellaneous Technical U+2300-23FF 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for the control code pix is divided into the following 
ranges:

	U+2300-2328: 	Miscellaneous technical symbols
	U+2329-23FF: 	Currently unassigned

Author accepting responsibility is
:____Asmus Freytag_________________





Control Pix U+2400-243F 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for the control code pix is divided into the following 
ranges:
	U+2400-241F: 	Pictorial representation for control codes U+0000-001F
	U+2420-2423: 	Pictorial representations for "Space" and "Delete"
	U+2424-243F: 	Currently unassigned

Author accepting responsibility is
:_____________________


OCR U+2440-245F 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for OCR is divided into the following ranges:
	U+2440-244A	OCR Symbols
	U+244B-245F: 	Currently unassigned

Author accepting responsibility is
:_____________________




Enclosed Alphanumerics U+2460-24FF 

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for enclosed alphanumerics is divided into the following 
ranges:
	U+2460-2473: 	Encircled numbers 1-20
	U+2474-2487: 	Parenthesized numbers 1-20
	U+2488-249B: 	Numbers with period 1-20
	U+249C-24B5: 	Parenthesized small Latin a-z
	U+24B6-24CF:	Encircled capital Latin A-Z
	U+24D0-24E9: 	Encircled small Latin a-z
	U+24EA-24FF: 	Currently unassigned


Author accepting responsibility is
:_____________________





Form and Chart Components U+2500-257F  
Forms

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for forms is divided into the following ranges:
	U+2500-254F: 	Single line box and line drawing elements
	U+2550-256C: 	Line box drawing elements with double line segments
	U+256D-2574: 	Miscellaneous
	U+2575-257F: 	Currently unassigned

Blocks

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for blocks is divided into the following ranges:
	U+2580-2593	Block and bar characters
	U+2594-259F: 	Currently unassigned

Geometric Shapes

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for Geometric Shapes is divided into the following 
ranges:
	U+25A0-25E5	Geometric shapes
	U+25E6-25FF: 	Currently unassigned

Author accepting responsibility is
:_____________________




Basic Dingbats & Miscellaneous U+2600-26FF

Basic Dingbats & Miscellaneous

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for Basic Dingbats and Miscellaneous is divided into the 
following ranges:
	U+2600-2674	Basic Dingbats and Miscellaneous
	U+2675-26FF: 	Currently unassigned

Author accepting responsibility is
:_____________________




Chinese/Japanese/Korean Non-ideographic Symbols U+3000-33FF

CJK Symbols and Punctuation U+3000-303F

Standards: Based on 2nd DP ISO 10646

Encoding structure: The Unicode block for CJK Symbols and Punctuation is divided into the 
following ranges:
	U+3000-3031	CJK Current Symbols and Punctuation
	U+3032-303F: 	Currently unassigned

Hiragana U+3040-309F

Hiragana is the cursive syllabary used to phonetically write Japanese words, sentence particles 
and inflectional endings.  Hiragana are commonly used as well to indicate the pronunciation of 
Japanese words. Hiragana are  phonetically equivalent to corresponding Katakana syllables.  

Standards: the Unicode Hiragana block is based on the JIS X 0208-1983 standard, extended by 
the non-standard syllable U+3094  VU, which is included to  accommodate 1:1 mapping 
between Katakana and Hiragana syllables.

Encoding structure: The Unicode block for the Hiragana script is divided into the following 
ranges:
	U+3040-3093: 	Mapping of the JIS X 0208 standard
	U+3094: 	Variant form
	U+3095-309A: 	Currently unassigned
	U+309B-309C: 	Diacritical marks
	U+309D-309E: 	Punctuation like characters
	U+309F: 	Currently unassigned

Diacritical marks: Hiragana and the related script Katakana use the two diacritics encoded in 
this block to generate voiced and semi-voiced syllables from the base syllables.  In the Unicode 
design, these diacritical marks follow the base character.

Punctuation-like characters: These are the Hiragana specific iteration and voiced iteration 
marks. 

Katakana U+A000-30FF 

Katakana is the syllabary used to phonetically write non-Japanese (usually Western) words. 
Katakana are commonly used as well to write Japanese words in order to create visual 
emphasis. Katakana are  phonetically equivalent to corresponding Hiragana syllables.

Standards: the Unicode Katakana block is based on the JIS X 0208-1983 standard.

Encoding structure: The Unicode block for the Hiragana script is divided into the following 
ranges:
	U+30A0-30F6: 	Mapping of the JIS X 0208 standard
	U+30F7-30FB: 	Currently unassigned
	U+30FC-30FE: 	Punctuation like characters
	U+30FF: 	Currently unassigned

Punctuation-like characters: These are the Katakana conjunctive, the Hiragana/Katakana 
prolonged-syllable mark, the specific iteration and the voiced iteration marks. 

Author accepting responsibility is
:_____Lee Collins_____



Zhuyinfuhao: Chinese Bopomofo Phonetic Symbols  U+00-312F

Standards: Based on the GB2312-80, Big-5, and CNS Standards

Encoding structure: The Unicode block for Bopomofo is divided into the following ranges:
	U+3100-312A: 	Mapping of GB2312-80, CNS, and IBM Big-5 Bopomofo Sections
	U+312B-312F: 	Currently unassigned

Author accepting responsibility is
:____Jim Caldwell_________________



Hangul Elements: Basic Korean Phonetic Symbols U+30-318F

Standards: Unicode follows KS C 5601-87 for Hangul elements. 

Encoding structure: The Unicode block for Hangul elements is divided into the following 
ranges:
	U+3130-3163: 	Mapping of KS C 5601 standard: Modern Jamo elements
	U+3164-318E: 	Mapping of KS C 5601 standard: Archaic Jamo elements
	U+318F: 	Currently unassigned

Author accepting responsibility is
:____Lee Collins______





More CJK SymbolsU+90-319F

Currently this block contains Unicodes for the four most recent Japanese eras, 
	U+3190 # = 	Meiji era 	1867 - 1912, 
	U+3191 # = 	Taishou era	1912 - 1926,
	U+3192 # = 	Showa era	1926 - 1989,
	U+3193 # - 	Heisei era	1989 - 
 
Encoding structure: The Unicode block for more CJK symbols is divided into
the following ranges:
	U+3190-3193: 	Japanese era names
	U+3194-31FF: 	Currently unassigned

Author accepting responsibility is
:____Lee Collins______



CJK Parenthesized, Circled and Squared Abbreviations U+3200-33FF

CJK Parenthesized U+3200-325F 

!! NOTE: Standards mention is tentative

Standards: The CJK Parenthesized block provides mapping for all the parenthesized Hangul 
elements from Korean standard KS C 5601 as well as parenthesized ideographic characters from 
JIS ?? standard, CNS ???? as well as several corporate registries.

Encoding structure: The Unicode block for CJK Parenthesized is divided into the following 
ranges:
	3200-320D	Parenthesized Hangul Elements
	320E-321F	Parenthesized Hangul syllables
	3220-323A	Parenthesized ideographs
	323B-325F	Currently unassigned

CJK Encircled U+3260-32FF 

	U+3260-326D:	Circled Hangul elements
	U+326E-327B:	Circled Hangul syllables
	U+327C-327F:	Currently unassigned
	U+327F:	Korean Standard Symbol
	U+3280-32A8:	Circled ideographs
	U+32A9-32CF:	Currently unassigned
	U+32D0-32FE:	Circled Katakana
	U+32FF:	Japanese Industrial Standard symbol

Author accepting responsibility is
:____Lee Collins_________________


CJK Squared Katakana Words and Latin Abbreviation Symbols U+3300-33FF 

CJK squared Katakana words are katakana spelled words that fill a single characters position if 
intermixed with ideographic Kanji characters. The set of squared Katakana words and Latin 
abbreviation symbols is derived from various company registries. 

Encoding structure: The Unicode block for CJK squared symbolic abbreviations is divided into 
the following ranges:

	U+3300-335A:	Squared Symbolic Katakana Words
	U+335B-337F:	Currently unassigned
	U+3380-33DD:	Squared Latin Abbreviation Symbols
	U+33DE-33FF:	Currently unassigned




Korean Hangul SyllablesU+3400-

Korean Hangul Syllables

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for Hangul syllables is divided into the following 
ranges:
	U+3190-3193: 
	U+3194-31ff: 	Currently unassigned

Extended Korean Hangul Syllables

!! NOTE: Standards mention is tentative

Encoding structure: The Unicode block for extended Hangul syllables is divided into the 
following ranges:
	U+3190-3193: 
	U+3194-31FF: 	Currently unassigned

Author accepting responsibility is
:_____Lee Collins________________


Extended Korean Hangul Syllables (cont.) 3E 

Author accepting responsibility is
:____Lee Collins_________________



Chinese/Japanese/Korean Ideographs U+4000 
From: Becker.OSBU_North@xerox.com
Subject: UniHan Levels
To: Unicode
Cc: davis.mark@applelink.apple.COM, liao@apple.com,
        Becker.OSBU_North@xerox.com
Message-Id: <"27-Sep-90 11:57:53 PDT".*.Joseph_D._Becker.OSBU_North@Xerox.com>


The proposed content of the UniHan Levels appears to have stabilized:


------------------------------------------------------------------
Level I      "Common"        (roughly 10,500 characters)
    Major Standards
        All of GB 2312-80 "G0"				( 6,763)
        All of GB ....... "G1"				( 6,951)
        All of JIS X0208-1983				( 6,353)
        All of KS C5601-1987				( 4,888)
        Taiwan CNS 11643-86 / Big Five  LEVEL 1		( 5,401)
        Taiwan CNS 11643-86 / Big Five "symbols"	(     9)
        Taiwan CCCII "Common" Level			( 4,808)
------------------------------------------------------------------
Level II     "Secondary"     (roughly  8,500 characters)
    Major Standards
        Rest of JIS draft supplementary set		( 5,843)
        Rest of Taiwan CNS 11643-86 / Big Five  LEVEL 2	( 7,652)
        Rest of ANSI/ NISO Z39.64-1989 = EACC		(13,481)
    Other Sources
        Rest of Xerox corporate collection
            Includes Telegraph Codes, Cantonese, etc.	( 9,776)
------------------------------------------------------------------


------------------------------------------------------------------
Level III    "Rare"          (...)
    Major Standards
        Rest of Taiwan CNS proposed extensions		( 6,339)
        Rest of Taiwan CCCII "Next Freq" Level		(17,032)
        Rest of GB 7589-87 "G2"				( 7,144)
        Rest of GB ....... "G3"				( 7,144)
        Rest of GB 7590-87 "G4"				( 6,956)
        Rest of GB ....... "G5"				( 6,956)
        Rest of other future national extensions	(     ?)
    Other Sources
        XinHua News Agency additions			(   694)
	GB Korean "Yidu" row				(    94)
        Rest of Japanese corporate standards		(     ?)
        Rest of Taiwan phone company name lists		(     ?)
        Rest of selected fonts, dictionaries, etc.	(     ?)
------------------------------------------------------------------


(The Xerox corporate collection is included in Level II because it represents
years of research into characters which are useful but which are not included
in national standards, e.g. characters specific to writing Cantonese.)


UniHan Version 1.0 will consist of Levels I & II.  Level I encompasses today's
existing standards that are in the 6,500 character range.  For pragmatic
reasons, the remainder of existing standards in the 13,000 character range are
placed in Level II.  The content listed above for Level III is merely
suggestive; requests for membership in Level III could accumulate for the rest
of the century.

This approach enables generic "Multilingual/International" systems to implement
UniHan Level I, which would become the one fixed standard Han character set for
covering all genuinely common CJK usage.  At the same time, Level II would be
available for producing full-functionality systems.  Level III would eventually
serve the needs of specialist applications.


Joe



Since each UniHan level is to be sorted in "radical/stroke order", that
ordering needs to be precisely defined.  The following are sketches toward
making that definition.


THE RADICALS

The overriding goal is to make the minimal augmentation to the traditional
KangXi system to be able to accommodate the PRC simplified characters.  There
is no attempt at all to make any innovative reform to the KangXi system.  In
particular, all traditional characters will receive a totally traditional
treatment, so the only real problem is to define the treatment of simplified
characters.  Thus, the UniHan radicals will consist of the 214 traditional
KangXi radicals plus some number of PRC simplified radicals.

The authorities taken for the PRC simplified radicals are the encoding standard
GB2312-80 plus two authoritative dictionaries Xin CiHai (XCH) and XianDai HanYu
CiDian (XDHYCD).  Based on these, the proposed list of 22 PRC simplified
radicals is as follows:

----------------------------------------
	XDHYCD	Trad	Meaning
	------	----	-------
	 27	149	speech
	 59	184	food
	 63	169	door
	 64	 90	bed
	 76	187	horse
	 77	120	silk
	 83	178	leather (wei)
	 91	159	vehicle
	102	154	cowry shell
	103	147	see
	116	182	wind
	137	212	dragon
	146	167	gold, metal
	152	196	bird
	171	181	page
	187	210	alike
	195	199	wheat
	203	197	salt
	210	213	tortoise
	219	211	tooth
	221	205	frog
	223	195	fish
----------------------------------------


Included are the 2 simplified forms of traditional radicals that are in XCH &
XDHYCD but not in GB2312-80:

	XDHYCD	Trad	Meaning
	------	----	-------
	187	210	alike
	210	213	tortoise

Excluded are all newly added PRC radicals that are not simplified versions of
traditional radicals, in particular the 2 that are in GB2312-80:

	GB2312	Sound	Meaning
	-----	-----	-------
	111	ye4	industry (simplified form)
	169	qi2	its (mo-ming-qi-miao de!)

Excluded are all revisions, reassignments, recombinations, and separated
variants of the 214 traditional KangXi radicals, for example:

	XDHYCD	Trad	Meaning
	------	----	-------
	46	 64	hand (ti shou pang)
	65	 85	water (san dian shui)





DETERMINING THE RADICAL OF A CHARACTER:

    (1) If the character itself is a (Uni)Radical, it is assigned under itself
        Example:
            * The traditional character for "dragon" is assigned to traditional
Radical 212
            * The Japanese simplified character for "dragon" is also assigned
to traditional Radical 212 (since the Japanese themselves use this approach and
not additional simplified radicals)
            * The Chinese simplified character for "dragon" is assigned to the
simplified radical for "dragon" (and not to any graphical sub-fragment of it)


    (2) If the character has a traditional KangXi radical (ala Dai KanWa JiTen,
CCCII, etc.), use that
        This includes special cases:

            > All Japanese and Korean -unique characters are assigned into the
KangXi system as is done in their native dictionaries and in JIS standards

            > Traditional characters having traditional radicals that GB2312-80
& XCH & XDHYCD treat in innovative ways are to be treated in the traditional
way (e.g. characters having san-dian-shui are mixed in at random with the other
Radical 85'ers as is traditional)

            > Simplified characters (other than radicals themselves) which
still contain the same radical as their traditional form are assigned to the
traditional value of that radical
              Example: the simplified form of hu2 "(tea)pot" contains the same
radical as the unsimplified form (traditional 33), but XCH & XDHYCD (not
GB2312-80!) map Radical 33 to Radical 32; in UniHan the wholly traditional
Radical 33 would be used

            > Simplified characters (other than radicals themselves) which no
longer contain their traditional radical at all are assigned to the "new"
radical given by GB2312-80 & XCH & XDHYCD that is actually a fragment of the
simplified glyph, and not to the same radical as the unsimplified version of
the character
              Example: the 3-stroke simplified form of wei4 "to protect" (as in
weisheng or weibing) is assigned to traditional Radical 26, and not to
traditional Radical 144 which is the radical of the unsimplified version of the
same character
              Discussion: this convention seems to make more practical sense
than assigning a character to a radical that is visually unrelated to its glyph


    (3) If the character has one of the 22 PRC simplified versions of
traditional radicals, use that


    (4) Otherwise, in the rare cases not covered above, improvise
        Example: the simplified form of ye4 "industry" does not fall into
categories (1)-(3), suggest assigning it to UniRad 1 for instance





THE ORDERING OF RADICALS

We have considered four possible schemes for ordering characters having PRC
simplified radicals relative to characters having traditional KangXi radicals:

    (1) Intersperse at the character level: characters having PRC radicals
immediately follow their unsimplified counterparts

    (2A) Intersperse at the group level such that the group of all characters
having a given the PRC radical immediately follows the group of all characters
having that radical's unsimplified counterpart (e.g. the 2-stroke PRC
simplified form of Radical 149 "speech" would immediately follow the 7-stroke
Radical 149)

    (2B) Intersperse at the group level based on the stroke count of the
radical (e.g. the 2-stroke PRC simplified form of Radical 149 "speech" would be
near the front)

    (3) Segregate all PRC radicals to the end, so that the first 214 radicals
are the traditional KangXi ones and then the PRC simplified radicals follow as
numbers 215 through 236


Although no ordering is free from problems, we picked one of the above as the
most appropriate for UniHan ... see if you can guess which ...


Joe


Draft already in Manual
Editor is working with Author to revise

Author accepting responsibility is
:____Lee Collins_________________



Private Use Area (Codes defined by Private Agreements)U+F000-FFFE 

Author accepting responsibility is
:_____________________


Compatibility Zone for IBM CodePages

Author accepting responsibility is
:_____________________

Unicode Draft: Character Blocks and Block Introductions	9/27/90