|
|
|
| Version | Unicode 3.1.0 |
| Authors | Mark Davis, Michael Everson, Asmus Freytag, John H. Jenkins and other members of the editorial committee |
| Date | 2001-05-16 |
| This Version | http://www.unicode.org/unicode/reports/tr27/tr27-4.html |
| Previous Version | http://www.unicode.org/unicode/reports/tr27/tr27-3.html |
| Latest Version | http://www.unicode.org/unicode/reports/tr27 |
| Tracking Number | 4 |
This document defines Version 3.1 of the Unicode Standard. It overrides certain features of Unicode 3.0.1, and adds a large number of coded characters.
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, carrying the same version number, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).
Unicode 3.1 is a minor version of the Unicode Standard. It overrides certain features of Unicode 3.0.1, and adds a large number of coded characters.
The Unicode Standard, Version 3.1 is defined by the following list. The version numbering and the role of each component are explained in Versions of The Unicode Standard. The symbols in the change status column are explained in the key below. A summary of modifications in the Unicode Character Database for this version can be found in UnicodeCharacterDatabase-3.1.html, together with a list of which data files contain normative vs. informative data.
N New in this release D Data change (possibly also format/text change) F Data format change (possibly also text change) T Text annotation change - Unchanged
The primary feature of Unicode 3.1 is the addition of 44,946 new encoded characters. These characters cover several historic scripts, several sets of symbols, and a very large collection of additional CJK ideographs.
For the first time, characters are encoded beyond the original 16-bit codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the forthcoming standard ISO/IEC 10646-2. For further information, see Article IX, Relation to 10646. Unicode 3.1 and 10646-2 define three new supplementary planes:
The Supplementary Multilingual Plane, or Plane 1, contains several historic scripts, and several sets of symbols: Old Italic, Gothic, Deseret, Byzantine Musical Symbols, (Western) Musical Symbols, and Mathematical Alphanumeric Symbols. Together these comprise 1594 newly encoded characters.
The Supplementary Ideographic Plane, or Plane 2, contains a very large collection of additional unified Han ideographs known as Vertical Extension B, comprising 42,711 characters, as well as 542 additional CJK Compatibility ideographs.
The Supplementary Special-purpose Plane, or Plane 14, contains a set of tag characters, 97 in all.
Complete introductions to the newly encoded scripts, symbols, and new additions to Han ideographs can be found in Article V, Block Descriptions, below.
In addition, Unicode 3.1 adds two mathematical symbols in the BMP:
U+03F4 GREEK CAPITAL THETA SYMBOL
U+03F5 GREEK LUNATE EPSILON SYMBOL
These two characters are not part of ISO/IEC 10646-2, but are among the additions in the forthcoming Amendment 1 to ISO/IEC 10646-1:2000. They are included in Unicode 3.1 so that decompositions for the Mathematical Alphanumeric Symbols can be internally consistent.
Counting the additions to the three supplementary planes and the two characters on the BMP, Unicode 3.1 adds 44,946 new encoded characters. Together with the 49,194 already existing characters in Unicode 3.0, that comes to a grand total of 94,140 encoded characters in Unicode 3.1.
Of those 94,140 characters, 70,207 are unified Han ideographs, and an additional 832 are CJK Compatibility ideographs -- slightly more than 75% of the encoded characters in the standard.
In addition, 32 more code points have been allocated as noncharacters. For more information, see Article III, Conformance.
See Article VI, Code Charts, for links to online charts of the new characters for Unicode 3.1.
Unicode 3.1 also features amended contributory data files, to bring the data files up to date against the much expanded repertoire of characters. A summary of the new data files and changes to old data files can be found in Article VIII, Unicode Character Database Changes. A complete specification of the contributory data files constituting the Unicode Standard, Version 3.1 can be found in Enumerated Versions.
All errata and corrigenda to Unicode 3.0 and Unicode 3.0.1 are included in this specification. Major corrigenda and other changes having a bearing on conformance to the standard are listed in Article III, Conformance. Other minor errata are listed in Article VII, Errata.
Most notable among the corrigenda to the standard is a tightening of the definition of UTF-8, to eliminate a possible security issue with non-shortest-form UTF-8.
The sections of this document are referred to as "articles" to prevent confusion with references to sections of The Unicode Standard, Version 3.0. In addition, the articles in this document are numbered with Roman numerals, to highlight the distinction. The word "section" always refers to sections of The Unicode Standard, Version 3.0. Page numbers also refer to The Unicode Standard, Version 3.0.
New or replacement text for the standard is indicated with underlined text, when this new text is a corrigendum of an existing section of the standard.
Deleted text from the standard is indicated with struck-through
text.
In instances where entire new sections or subsections are to be added to the standard, as for the block descriptions for newly encoded scripts or symbol sets, new section numbers are provided that interleave reasonably with the existing sections of the published Unicode 3.0 book. And for these added sections, the text is not underlined, since the entire sections are new.
In this document, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.
Some of the characters in Article 5, Block Descriptions, are Greek and may not be displayed by all browsers. For assistance, see Display Problems.
Section 0.2 Notational Conventions, page xxviii: change the description of the U+ notation to read:
In running text, an individual Unicode code point can be expressed as U+n, where n is from four to six hexadecimal digits, using the digits 0-9 and A-F (for 10 through 15, respectively). There should be no leading zeros, unless the codepoint would have fewer than four hexadecimal digits; for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.
Section 0.2 Notational Conventions, page xxviii: replace the paragraph starting "A sequence of characters" with the following text:
A sequence of two or more code points may be represented by a comma-delimited list, set off by angle brackets. For this purpose angle brackets consist of U+003C LESS-THAN SIGN and U+003E GREATER-THAN SIGN. Spaces are optional after the comma, and U+ notation for the code point is also optional. A sequence identified with this notation is called a Unicode Sequence Identifier (USI).
When the usage is clear from the context, a sequence of characters may also be represented with generic short names, for example as in "<a, grave>", or the angle brackets may be omitted.
In contrast to sequences of code points, a sequence of one or more code units may be represented by a list set off by angle brackets, but without comma delimitation or U+ notation. For example, the notation "<nn nn nn nn>" represents a sequence of bytes, as for the UTF-8 encoding form of a Unicode character. The notation "<nnnn nnnn>" represents a sequence of 16-bit code units, as for the UTF-16 encoding form of a Unicode character. In the text, the angle brackets are occasionally omitted from this notation when the usage is clear in context.
In other environments, such as programming languages or mark-up, alternative notation for sequences of code points or code units may be used.
On page xxvii, in the section, "The Unicode Character Database and Technical Reports," the paragraph beginning, "The following Unicode Technical Reports..." is updated to read as follows:
The following Unicode
Technical ReportsStandard Annexes are formally part of this standard:
- UAX #9: The Bidirectional Algorithm, Version 3.1.0
UTRUAX #11: East Asian Width, Version5.03.1.0UTRUAX #13: Unicode Newline Guidelines, Version5.03.1.0UTRUAX #14: Line Breaking Properties, Version6.03.1.0UTRUAX #15: Unicode Normalization Forms, Version18.03.1.0- UAX #19: UTF-32, Version 3.1.0
There are three major changes to the conformance clauses of the Unicode Standard for Version 3.1. The first of these is the addition of new noncharacters and a clarification regarding noncharacter status. The second is a major corrigendum to the definition of UTF-8 to address security issues. The third change is that UTF-32 is now part of the standard. There are additional normative changes in Unicode 3.1 that have implications for conformance. These are described in Article VIII, Unicode Character Database Changes, and in Section 13.2 Layout Controls of Article V, Block Descriptions.
In Section 3.1, Conformance Requirements on page 37, add the following paragraph immediately after the first paragraph and before the subsection, "Byte Ordering":
Each version of the Unicode Standard, once published, is absolutely stable and will never change. Implementations or specifications that refer to a specific version of the Unicode Standard can rely upon this stability. If future versions of these implementations or specifications upgrade to a future version of the Unicode Standard, then some changes may be necessary.
To clarify the interpretation of Unicode code units in the context of the transformation formats, conformance clause C1 has been reworded:
C1 A process shall interpret the Unicode code values as 16-bit quantitiesunits in accordance with the Unicode Transformation Format used.
Unicode values can be stored in native 16-bit machine words.- The Unicode Standard defines code points (scalar values) that can be encoded in any of three transformation formats (encoding forms): UTF-8, UTF-16, or UTF-32.
- For information on the use of wchar_t or other programming language types to represent Unicode
valuescode units, see Section 5.2, ANSI/ISO C wchar_t.
There are 34 specific code points in Unicode 3.0 that are characterized as noncharacters. Unicode 3.1 adds an additional 32 noncharacters. To clarify the status of all 66, a definition (page 41) is added, and conformance rules C5 and C10 (pages 38, 39) are amended as follows:
D7b Noncharacter: a code point that is permanently reserved for internal use, and that should never be interchanged. In Unicode 3.1, these consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
- For more information, see the discussions under "Special Noncharacter Values" in Section 2.7, Special Character and Noncharacter Values, and under "Noncharacters" in Section 13.6, Specials.
- These code points are permanently reserved as noncharacters. In the future, it is possible that additional code points may be specified to represent noncharacters.
C5 A process shall not interpret either U+FFFE or U+FFFFa noncharacter code point as an abstract character.
- The code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
C10 A process shall make no change in a valid coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points, if that process purports not to modify the interpretation of that coded character sequence.
- If a noncharacter which does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point. For example, an API that returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point.
The current conformance clause C12 in The Unicode Standard, Version 3.0 forbids the generation of "non-shortest form" UTF-8, and forbids the interpretation of illegal sequences, but not the interpretation of "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:
To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.
These modifications make use of updated notation: see the Glossary for any unfamiliar terms.
Change C12 to the following:
| C12 | (a) When a process generates data in a
Unicode Transformation Format, it shall not emit ill-formed (b) When a process interprets data in a Unicode Transformation Format, it shall treat illegal (c) A conformant process shall not interpret illegal UTF code unit sequences as characters. (d) Irregular UTF code unit sequences shall not be used for encoding any other information. |
Add the following notes after C12:
For example, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 conformant process may map the code value sequence C0 80 (110000002 100000002) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence -- it shall generate the sequence 00 (000000002) instead.
Modify D36 as follows, and add a note:
| D36 | (a) UTF-8 is the Unicode Transformation
Format that serializes a Unicode code point as a sequence of one to four
bytes, as specified in Table 3.1, UTF-8 Bit Distribution. (b) An illegal UTF-8 code unit sequence is any byte sequence that does not match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences. (c) An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-8 sequences shall not be generated by a conformant process. |
Retain the paragraph and table immediately below D36, but replace the last sentence in the paragraph.
Table 3.1 specifies the bit distribution from a Unicode character (or surrogate pair) into the one- to four-byte values of the corresponding UTF-8 sequence. Note that the four-byte form for surrogate pairs involves an addition of 1000016, to account for the starting offset to the encoded values referenced by surrogates. For a discussion of the difference in the formulation of UTF-8 in ISO/IEC 10646, see Section C.3, UCS Transformation Formats.
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters.
Table 3.1. UTF-8 Bit Distribution Scalar Value UTF-16 1st Byte 2nd Byte 3rd Byte 4th Byte 00000000 0xxxxxxx00000000 0xxxxxxx0xxxxxxx00000yyy yyxxxxxx00000yyy yyxxxxxx110yyyyy10xxxxxxzzzzyyyy yyxxxxxxzzzzyyyy yyxxxxxx1110zzzz10yyyyyy10xxxxxx000uuuuu zzzzyyyy
yyxxxxxx110110ww wwzzzzyy
110111yy yyxxxxxx11110uuu10uuzzzz10yyyyyy10xxxxxx
- Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates).
Delete the two text paragraphs after Table 3.1. (The relevant portions have been elevated into definitions or conformance clauses.)
When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. This practice preserves uniqueness of encoding. For example, the Unicode binary value <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. The latter is an example of an irregular UTF-8 byte sequence. Irregular UTF-8 sequences shall not be used for encoding any other information.
When converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm.
Replace them by the following table and text:
Table 3.1B. Legal UTF-8 Byte Sequences Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+FFFF E1..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Table 3.1B. lists all of the byte sequences that are legal in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is legal in that position. Any byte value outside of the ranges listed is illegal. For example, the byte sequence <C0 AF> is illegal since C0 is not legal in the 1st Byte column. The byte sequence <E0 9F 80> is illegal since in the row where E0 is legal as a first byte, 9F is not legal as a second byte. The byte sequence <F4 80 83 92> is legal, since every byte in that sequence matches a byte range in a row of the table (the last row).
Add to Appendix C: Relationship to ISO/IEC 10646, Section C.3: UCS Transformation Formats, at the end of the subsection UTF-8:
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters. ISO/IEC 10646 does not allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF (but it does allow other noncharacters).
Unicode Technical Report #19, UTF-32, has been elevated to the status of a Unicode Standard Annex, making UTF-32 officially a part of the Unicode Standard. UAX #19 adds specific definition clauses to Section 3.8, Transformations, of The Unicode Standard, Version 3.0. See UAX #19 for the exact definitions of UTF-32 as well as a discussion of the relation of UTF-32 to ISO/IEC 10646 and UCS-4.
With the addition of UTF-32, the Unicode Standard now has three sanctioned encoding forms: UTF-8, UTF-16, and UTF-32. These are the 8-bit, 16-bit, and 32-bit forms, respectively, for representing the Unicode scalar values in particular implementations of the standard.
Considerations of byte-order serialization lead to a further subdivision of the encoding forms into 5 sanctioned encoding schemes for the Unicode Standard: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
Because UTF-32 is a fixed-width, 32-bit encoding form, the numerical value of a Unicode character in UTF-32 is always precisely identical to the Unicode scalar value.
The encoding scheme UTF-32BE (UTF-32 serialized as bytes in most significant byte first order) is structurally the same as UCS-4, as defined in ISO/IEC 10646-1:2000.
See also Unicode Technical Report #17, Character Encoding Model, for a discussion of the general framework for understanding the Unicode character encoding and its relationship to the Unicode Transformation Formats.
Add the following entry to the end of the special character properties listing, on page 50:
1D173 MUSICAL SYMBOL BEGIN BEAM
1D174 MUSICAL SYMBOL END BEAM
1D175 MUSICAL SYMBOL BEGIN TIE
1D176 MUSICAL SYMBOL END TIE
1D177 MUSICAL SYMBOL BEGIN SLUR
1D178 MUSICAL SYMBOL END SLUR
1D179 MUSICAL SYMBOL BEGIN PHRASE
1D17A MUSICAL SYMBOL END PHRASE
All of the General Category values plus the case mappings in UnicodeData.txt and SpecialCasing.txt are now normative. The case mapping row from Table 4-2, Informative Character Properties, page 74 is moved to Table 4-1, Normative Character Properties. The word "informative" is struck from Table 4-5, General Category, page 88. The header of Section 4.5, General Category--Normative in Part, page 87 is changed to Section 4.5, General Category--Normative. The other textual changes in Chapter 4 resulting from this change in status are not detailed here.
On page 73, make the following changes:
Normative Properties. Normative means that implementations that claim conformance to the Unicode Standard (at a particular version) and that make use of a particular property must follow the specifications of the standard for that property to be conformant.
Thus, for example, the Bidirectional Character Type is required for conformance whenever displaying bidirectional text, such as Arabic or Hebrew. The term normative when applied to a character property does not mean that the value of the property will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes.Informative Properties. If a character property is only informative, a conformant implementation is free to use or change such values as it may require, while still remaining conformant to the standard. However, their use is strongly recommended. Particular implementations may choose to override the properties that are not normative. In that case, the implementer has the option of establishing a protocol to convey that information.
Normative References. Other specifications may choose to make normative references to Unicode character properties irrespective of their status as normative or informative in the Unicode Standard.
On page 102, add the following at the bottom of the page:
Identifier Stability. Unicode General Category values are kept as stable as possible, but they may change in ways that affect identifiers in new versions (See Unicode Policies for more information.) When another standard or product upgrades to a new version of the Unicode Standard, it may have to handle characters that were formerly part of ID_Start or ID_Continue, but are no longer.
This situation can be handled by having two explicit backwards compatibility lists: ID_Start_Supplement and ID_Continue_Supplement. The implementation's specification of identifiers would include the union of the respective Unicode properties and those supplement lists.
UAX #9 supersedes the text in Section 3.12, Bidirectional Behavior, in The Unicode Standard, Version 3.0. There are minor, non-normative textual revisions to the text of UAX #9 for Unicode 3.1.
In a corrigendum to UAX #15, U+FB1D YOD WITH HIRIQ has been added to the Composition Exclusion List. For more information, see UAX #15.
The following text amends portions of Chapter 5, Implementation Guidelines in The Unicode Standard, Version 3.0.
Section 5.2, ANSI/ISO C wchar_t, pages 107-108, the text is amended with the following additions and deletions.
With the wchar_t wide character type, ANSI/ISO C provides for the inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension. The Unicode characters in the ASCII range U+0020 to U+007E satisfy these conditions. Thus, if an implementation uses ASCII to code the portable C execution set, the use of the Unicode character set for the wchar_t type,with a width of 16 bitsin either UTF-16 or UTF-32 form, fulfills the requirement.
The width of wchar_t is compiler-specific and can be as little as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers. However,someprogrammers who want a UTF-16 implementation can use a macro or typedef (for example, UNICHAR) that can be compiled as unsigned short or wchar_t depending on the target compiler and platform. Other programmers who want a UTF-32 implementation can use a macro or typedef which might be compiled as unsigned int or wchar_t, depending on the target compiler and platform. This choice enables correct compilation on different platforms and compilers. Where a 16-bit implementation of wchar_t is guaranteed, such macros or typedefs may be predefined (for example, WCHAR on Win32 API).
On systems where the native character type or wchar_t is implemented as a 32-bit quantity, an implementation may use the UTF-32 formtransiently use 32-bit quantitiesto represent Unicode characters.during processing. The internal workings of this representation are treated as a black box and are not Unicode-conformant. In particular, any API or runtime library interfaces that accept strings of 32-bit characters are not Unicode-conformant. If such an implementation interchanges 16-bit Unicode characters with the outside world, then this interchange can be conformant as long as the interface for this interchange complies with the requirements of Chapter 3, Conformance.
A limitation of the ISO/ANSI C model is its assumption that characters can always be processed in isolation. Implementations that choose to go beyond the ISO/ANSI C model may find it useful to mix widths within their APIs. For example, an implementation may have a 32-bit wchar_t and process strings in any of UTF-8, UTF-16 or UTF-32 forms. Another implementation may have a 16-bit wchar_t and process strings as UTF-8 or UTF-16, but have additional APIs that process individual characters as UTF-32, or deal with pairs of UTF-16 code units.
Section 5.3, Unknown and Missing Characters: Unassigned and Private Use Character Codes, pages 108-109: add the following to the end of the subsection.
In practice, applications must deal with unassigned code points or unknown private use characters. This may occur, for example, when the application is handling text that originated on a system implementing a later release of Unicode, with additional assigned characters. To work properly in implementations, unassigned code points must be given default properties as if they were characters, since various algorithms require properties to be assigned to every character in order to function at all. These properties are not uniform across all unassigned code points, since certain ranges of code points need different properties to maximize compatibility.
Normally, code points outside the repertoire of supported characters would be displayed with a fall-back glyph, such as a black box. However, format and control characters must not have visible glyphs (although they may have an effect on other characters in display). These characters are also ignored except with respect to specific, defined processes: for example, ZERO WIDTH NON-JOINER is ignored in collation. To allow a greater degree of compatibility across versions of the standard, the ranges U+2060..U+206F, U+FFF0..U+FFFC, and U+E0000..U+E0FFF are reserved for format and control characters (General Category = Cf). Unassigned code points in these ranges should be ignored in processing and display.
The Unicode Bidirectional Algorithm assigns a Bidirectional Category to unassigned code points based on the expected direction of characters to be added in the future. For more information, see Bidirectional Character Types in Unicode Standard Annex #9: The Bidirectional Algorithm.
Unicode Standard Annex #14: Line Breaking Properties supplies the property "XX" for all unassigned code points in Definitions.
In determining character widths for East Asian display, Unicode Standard Annex #11: East Asian Width includes a section on Unassigned and Private Use characters.
In Unicode Standard Annex #15, Unicode Normalization Forms, unassigned code points are given the Canonical Combining Class = 0, and no decomposition mapping.
Section 5.16, Identifiers: Specific Character Additions, page 134: the subsection name is changed to Specific Character Adjustments, and the following note is added:
Note: a useful set of characters to consider for exclusion from identifiers consists of all characters whose compatibility mappings have a
<font>tag.
Section 5.11, Language Tagging in Plain Text, page 114: delete the following paragraph:
For interchange purposes, it is becoming common to use tagged information, which is embedded in the text. Unicode Technical Report #7, "Plane 14 Characters for Language Tags," which is found on the CD-ROM or in its up-to-date version on the Unicode Web site, provides a proposed mechanism for representing language tags. Like most tagging mechanisms, these language tags are stateful: a start tag establishes an attribute for the text, and an end tag concludes it.
The subsection Working with Language Tags, pages 114-115, has been moved to the newly created Section 13.7, Tag Characters, which is part of Article V, Block Descriptions. This is because its recommendations are specific to the tag characters described there.
Note: The numbering used here for block descriptions and revised text follows The Unicode Standard, Version 3.0 for ease of cross-reference.
Section 6.1, General Punctuation, Punctuation: U+0020-U+00BF, page 149: the following note is added:
Note: any of the characters U+002C, U+002E, U+060C, U+066B, or U+066C (and possibly others) can be used as numeric separator characters, depending on the locale and user customizations.
Section 6.1, General Punctuation, CJK Symbols and Punctuation: U+3000-U+303F, page 155: The first paragraph is updated as follows:
This block encodes punctuation marks and symbols used primarily by writing systems that employ Han ideographs. Some of the punctuation marks, in particular the brackets, are used in other typographic contexts as well. Most of these characters are found in East Asian standards.
Section 6.1 General Punctuation, CJK Symbols and Punctuation: U+3000-U+303F, page 155: add the following paragraph after the paragraph on "U+3006":
U+3008, U+3009 angle brackets have ambiguous width. They are wide in an East Asian context, but are narrow when used in other contexts, such as mathematics. There are other characters in this block that have the same characteristics, including double angle brackets, tortoise shell brackets, and white square brackets.
Note: The following text replaces the entire text of Section 7.5, Georgian, on page 173.
The Georgian script is used primarily for writing the Georgian language and its dialects. It is also used for the Svan and Mingrelian languages, and in the past was used for Abkhaz and other languages of the Caucasus.
Script Forms. The Georgian script originates from an inscriptional form called Asomtavruli, from which was derived a manuscript form called Nuskhuri. Together these forms are categorized as Khutsuri (ecclesiastical), but Khutsuri is not itself the name of a script form. Although no longer seen in most modern texts, the Nuskhuri style is still used for liturgical purposes. It was replaced, through a history now uncertain, by an alphabet called Mkhedruli (military), which is now the form used for nearly all modern Georgian writing.
Case Forms. The Georgian alphabet is fundamentally caseless, and is used as such in most texts. However, possibly owing to the influence of case forms in other alphabets, modern Georgian is occasionally written with uppercase capital letters. In this typographic departure, it is the Asomtavruli forms that serve to represent uppercase letters, while the lowercase is Mkhedruli or Nuskhuri. This usage parallels the evolution of the Latin alphabet,