Unicode Technical Report #8

The Unicode Standard®, Version 2.1

Revision 3.0
Authors Lisa Moore (lisam@us.ibm.com)
Date 1999-11-21
This Version http://www.unicode.org/unicode/reports/tr8/tr8-3
Previous Version http://www.unicode.org/unicode/reports/tr8/tr8-2
Latest Version http://www.unicode.org/unicode/reports/tr8

Summary

This report documents the Unicode Standard, Version 2.1.

Status of this document

This document contains informative material and normative specifications which have been considered and approved by the Unicode Technical Committee for publication as a Technical Report and as part of the Unicode Standard, Version 2.1. Any reference to version 2.1 of the Unicode Standard automatically includes this technical report. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 2.0. See http://www.unicode.org/unicode/standard/versions for more information.

Contents

1 Description

Version 2.1 of the Unicode Standard brings together two additions to the repertoire which are expected to be in wide use in a number of implementations, errata collected since the publication of Version 2.0, and a number of updates to the character properties database. The two newly added characters are the U+FFFC OBJECT REPLACEMENT CHARACTER and the U+20AC EURO SIGN. The object replacement character is already employed in multiple implementations, and the euro sign is expected to be widely used very soon as the European Monetary Union (EMU) proceeds to phase in its use as the EMU unit of currency. This modification of the Unicode Standard is made available so that implementers can proceed with their support plans knowing that their implementation of Unicode is a well-defined, conforming version. With the additions of Version 2.1, the Unicode Standard contains 38, 887 characters from the world’s scripts.

Additional characters and scripts have been accepted into the Unicode Standard since the publication of The Unicode Standard, Version 2.0. These are not included in Version 2.1 but are documented on the Unicode Web site at: http://www.unicode.org/unicode/alloc/Pipeline.html

1.1 Conformance

Overall Unicode conformance criteria as described in Chapter 3 of Version 2.0 are unchanged. Specific aspects of the bidirectional algorithm have been modified in Version 2.1, Hangul syllable decompositions have been clarified, and certain normative character property values have been changed.

2 Object Replacement Character

The U+FFFC OBJECT REPLACEMENT CHARACTER is used as an insertion point for objects located within a stream of text. All other information about the object is kept outside the character data stream. Internally it is a dummy character which acts as an anchor point for the object’s formatting information. In addition to assuring correct placement of an object in a data stream, the object replacement character also allows the use of general stream-based algorithms for any textual aspects of embedded objects

The object replacement character is classified as a Symbol, Other (So) and has a bidirectional category of Other Neutrals (ON).

Addition

p 7-523. Add to the standard the following character:

FFFC OBJECT REPLACEMENT CHARACTER

3 Euro Sign

The new single currency for member countries of the European Monetary Union (EMU) is the euro. The euro character is encoded in the Unicode Standard as U+20AC EURO SIGN.

To avoid confusion, the historical character U+20A0 EURO-CURRENCY SIGN has been updated with an informative note and a cross reference to U+20AC EURO SIGN.

The euro character is classified as Symbol, Currency (Sc) and has a bidirectional category of European Number Terminator (ET).

Corrigendum

p 7-161. Currency symbols character names list

Add the following informative note for character 20A0:

"Historical character derived from Xerox Character Code Standard"

Add the following cross reference for character 20A0:

"20AC euro sign"

Addition

p 7-161. Add to the standard the following character:

20AC EURO SIGN

Add the following informative note for 20AC:

"Currency sign for the European Monetary Union"

Add the following cross reference for 20AC:

"20A0 euro-currency sign"

4 Errata

4.1 Math Property Characters

Additional Unicode characters have been designated as having the mathematical property. Typos in the Version 2.0 list of characters with the mathematical property have also been corrected.

Corrigenda

p 4-25. In the list following section 4.9

Change 20A6 to 2016.

Change "20D2..20E1" to "20D0..20DC, 20E1".

Add the following characters to the list:

207A..207E SUPERSCRIPT PLUS SIGN.. SUPERSCRIPT RIGHT PARENTHESIS
208A..208E SUBSCRIPT PLUS SIGN.. SUBSCRIPT RIGHT PARENTHESIS
FB29 HEBREW LETTER ALTERNATIVE PLUS SIGN
FE35..FE38 PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS..PRESENTATION FORM FOR VERTICAL RIGHT CURLY BRACKET
FE59..FE5C SMALL LEFT PARENTHESIS..SMALL RIGHT CURLY BRACKET
FE61..FE66 SMALL ASTERISK..SMALL EQUALS SIGN
FE68 SMALL REVERSE SOLIDUS
FF08..FF0B FULLWIDTH LEFT PARENTHESIS..FULLWIDTH PLUS SIGN
FF0D FULLWIDTH HYPHEN-MINUS
FF0F FULLWIDTH SOLIDUS
FF1C..FF1E FULLWIDTH LESS-THAN SIGN.. FULLWIDTH GREATER-THAN SIGN
FF3B..FF3E FULLWIDTH LEFT SQUARE BRACKET.. FULLWIDTH CIRCUMFLEX ACCENT
FF5B..FF5E FULLWIDTH LEFT CURLY BRACKET.. FULLWIDTH TILDE
FFE2 FULLWIDTH NOT SIGN
FFE8..FFEC HALFWIDTH FORMS LIGHT VERTICAL.. HALFWIDTH DOWNWARDS ARROW

4.2 Letter Errata

Two characters have been removed from the alphabetics listing, U+02BC MODIFIER LETTER APOSTROPHE and U+055A ARMENIAN APOSTROPHE.

Corrigendum

p 4-14. Section 4.5 Letters

Remove 02BC and 055A from the table of alphabetics.

4.3 Canonical Decomposition Clarification

The status of Hangul Syllable decompositions have been clarified.

Corrigenda

p 3-7. D23

Change the first sentence to read: "canonical decomposition: the decomposition of a character which results from recursively applying the canonical mappings found in the names list of Section 7.1, Character Names List Entries and those described in Section 3.10 Combining Jamo Behavior until no characters can be further decomposed, and then reordering non-spacing marks according to Section 3.9, Canonical Ordering Behavior."

p 3-11. Section 3.10 Combining Jamo Behavior

Change the third bullet to: "determine the canonical decomposition of Hangul syllables"

p 3-13. Item 1

Change the first sentence to: "Process C by composing the conjoining jamo wherever possible, according to the compatibility decomposition rules in Chapter 7, Code Charts."

Change the fourth sentence to: "Raw keyboard data, on the other hand, may be in the form of a compatibility decomposition."

p 3-13. Hangul Syllable Decomposition

Change the first sentence to: "The following describes the reverse mapping - how to take Hangul syllable S and derive the canonical decomposition C."

4.4 Identifier Errata

New distinctions have been made in the Unicode Character Database for use in identifiers. In addition changes have been made to the text of the standard.

Corrigenda

p. 5-26, 27. Section 5.14 Identifiers

Add 06DD and 06DE to <enclosing_char>.

Add compatibility low lines FE33, FE34, FE4D..FE4F to <underscore>.

Remove 0387 from <extender>.

Remove <identifier_part> and its definition.

Change the <identifier> syntactic rule to:

"<identifier> ::= <identifier_start > ( <identifier_start> | <identifier_extend> )*"

Add the following syntactic rules at the end of the list:

"<identifier_extend> ::= <decimal_digit_char> | <ident_combining_char> | <underscore> | <extender> | <ident_ignorable_char> | <connector>
<connector> ::= { 203F, 2040 }"

Following the syntactic rules add the following:

"Identifiers are ultimately defined by a set of character categories from the Unicode Character Database. (The individual Terminal Classes described in the text do not have a one-to-one relationship with the character categories, but the resulting definitions of identifiers are intended to be the same.

Syntactic Class Equivalent Category Set Coverage
<identifier_start> Lu,Ll,Lt,Lm,Lo,Nl Uppercase letter, Lowercase letter, Titlecase letter, Modifier letter, Other letter, Letter number
<identifier_extend> Mn,Mc,Nd,Pc,Cf Non-spacing mark, Spacing combining mark, Decimal number, Connector punctuation, Formatting code
<ident_ignorable_char> Cf Formatting code

For an explicit list of the current coverage of each of these syntactic classes, see <identifier_start>, <identifier_extend>, and <ident_ignorable_char>."

4.5 Bidirectional Behavior Errata

Since the Unicode Standard Version 2, many aspects of the bidirectional behavior algorithm have been clarified or modified, including the basic display algorithm, bidirectional character types, base levels, resolving weak and neutral types, and resolving implicit levels. These changes affect pages 3-14 through 3-23 of the standard. Additionally, a few characters have been assigned new bidirectional type properties.

4.5.1 Basic Display Algorithm

The description of the scope of the algorithm within a block has been clarified, and a pointer to further information on the handling of CR and LF has been added.

Corrigendum

p 3-16. At the end of the paragraph before the first bullet, add:

"The algorithm only reorders text within a block; characters on one side of a block separator have no effect on characters on the other side. (Also, see Section 4.3, Directionality on the handling of CR, LF, and CRLF)"

4.5.2 Bidirectional Character Types

The following (together with a change to Reordering Resolved Levels) clarifies how to implement the last paragraph of page 3-16.

Corrigenda

p 3-17. Before Table 3-5, add:

"Combining marks are given the type of the preceding letter."

p 4-11. After "where there are gaps.", add:

"Combining marks are given the type of the preceding letter, and are not called out in this table either."

4.5.3 The Base Level

Several of the rules were corrected to say embedding direction rather than global direction. The first term is more explicitly defined.

Corrigendum

p 3-18. Before "Explicit Levels and Directions", insert:

"The direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd."

4.5.4 Terminating Embeddings and Overrides

T6 incorrectly removed implicit and explicit directional formatting codes. The original purpose of T6 was to allow the use of styles or style sheets instead of embedding or override codes (see p. 3-22). T6 has been eliminated, and N4 has been changed instead (see below).

Corrigendum

p 3-19. T6

Delete T6.

4.5.5 Resolving Weak Types

P1 has been clarified to state that it applies to single characters, and P2 more explicitly shows how to resolve a sequence of European terminators.

Corrigendum

p 3-19. P1

Change to "P1. A single European separator between two European numbers changes to an European number. A single common separator between two numbers of the same type changes to that type."

p 3-19. P2

Change to "P2. A sequence of European terminators adjacent to European numbers changes to all European numbers.

ET, ET, EN Arrow.gif (79 bytes) EN, EN, EN

EN, ET, ET Arrow.gif (79 bytes) EN, EN, EN

AN, ET, EN Arrow.gif (79 bytes) AN, EN,EN"

p 3-19. P3

Add example at end.  "ET, AN Arrow.gif (79 bytes) N, AN"

4.5.6 Resolving Neutral Types (1)

The wording in N2 has been modified to use the embedding direction instead of the global direction, and the confusing term "letter" has been changed to "character" which makes it clear that strong R punctuation should be included.

Corrigenda

p 3-19. N2

Replace "global" by "embedding".

p 3-20. N3

Change "letter" to "character" everywhere.

4.5.7 Resolving Neutral Types (2)

Since N4 describes the behavior of embedding codes, it has been moved to a more appropriate place in the algorithm. It replaces T6 and now describes the behavior of override codes as well.

Corrigenda

p 3-19, 20. Move N4 to where T6 was. Change the number to T6, and change the wording and examples to:

"T6. In the following rules, an embedding or override code and its matching PDF act as if they were strong characters of the appropriate type. All unmatched PDFs are ignored. If two embeddings with the same level are adjacent, then the PDF terminating the first embedding and the code initiating the next embedding are ignored.

LRO ... PDF Arrow.gif (79 bytes) L ... L

LRE ... PDF Arrow.gif (79 bytes) L ... L

RLO ... PDF Arrow.gif (79 bytes) R ... R

RLE ... PDF Arrow.gif (79 bytes) R ... R

RLE ... PDF, RLO ... PDF Arrow.gif (79 bytes) RLE ..., ... PDF"

4.5.8 Resolving Implicit Levels

I1 and I2 have been modified to ensure that implementers will use the embedding direction instead of the base direction. Also, although Table 3-7 refers to Sequence Type, the wording was not clear that the rules applied to sequences. This is important in the case of EN.

Corrigenda

p 3-20, 21. I1

Replace "global" by "embedding".

Replace "Numeric text (EN) goes up two levels unless preceded by left-to-right text." by: 

"A sequence of one or more numeric types (EN) goes up two levels unless immediately preceded by left-to-right text."

Change the example from "(L) EN" to "(L) EN...EN"

4.5.9 Reordering Resolved Levels

L1 incorrectly implied that there could be more than one block separator. This has been corrected and more explanation is provided.

Corrigenda 

p 3-20. L1

Add to the end of the paragraph before L1: 

"The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see pages 6-22 through 6-32). Logically there are the following steps: 

  • The levels of the text are determined according to the bidi algorithm.

  • The characters are shaped into glyphs according to their context (taking the embedding levels into account). 

  • The accumulated widths of those glyphs (in logical order) is used to determine line breaks. 

  • The glyphs on each line are then separately reordered according to the rules L1 and L2 below." 

Change in L1, "trailing white space (including block separators)" to "any trailing white space characters (including those of type B, S, and WS)". 

Add after L1, "(Note: since a Block separator breaks lines, there will be at most one per line.)" 

Before "Bidirectional Conformance", add:

"Combining marks applied to a right-to-left base character will at this point precede their base character. See Section 5.12 Rendering Non-Spacing Marks for an illustration of this. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character will need to be reversed."

4.5.10 Characters with New Directional Properties

Certain characters have new bidirectional property definitions. To improve the display of e-mail addresses and URLs, the directional types of U+0026 AMPERSAND and U+0040 COMMERCIAL AT have been changed from left-to-right to other neutral. The directional type of U+002E FULL STOP  has been changed from EUROPEAN NUMBER SEPARATOR   to COMMON NUMBER SEPARATOR to improve the display of decimal numbers; U+2007 FIGURE SPACE has also been changed from EUROPEAN NUMBER SEPARATOR  to COMMON NUMBER SEPARATOR for consistency.

Corrigenda

p 4-11. Table 4.4 Bidirectional Character Types

Remove the table entry "Miscellaneous U+0026, U+0040" from the strong left-to-right category.

Remove the table entries "Full Stop (Period) U+002E" and "Figure Space U+2007" from the European Number Separator category.

p 4-12. Table 4.4 Bidirectional Character Types

Add the table entries "Full Stop (Period) U+002E" and "Figure Space U+2007" to the Common Number Separator category.

4.6 Apostrophe Semantics Errata

The following corrigenda clarify the semantics of different apostrophes, and correct problems in the mapping tables from Windows and Macintosh code pages.
Corrigendum

p 6-3. Add at the end of Loose versus Precise Semantics:

"For historical reasons, U+0027 is a particularly overloaded character. In ASCII it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime) or a modifier letter (such as apostrophe modifier or acute accent.) (Punctuation marks generally break words; modifier letters generally are considered part of a word.) In many systems it is always represented as a straight vertical line and can never represent a curly apostrophe or right quotation mark.

In the case of an apostrophe,

  • U+02BC MODIFIER LETTER APOSTROPHE is preferred where the character is to represent a modifier letter (for example, in transliterations to indicate a glottal stop.) In the latter case, it is also referred to as a letter apostrophe.
  • U+2019 RIGHT SINGLE QUOTATION MARKis preferred where the character is to represent a punctuation mark, as in "We’ve been here before." In the latter case, U+2019 is also referred to as a punctuation apostrophe.

In implementation, however, you cannot assume that users’ text always adheres to the distinction between these characters. The text may come from different sources, including mapping from other character sets that do not have this distinction between letter apostrophe and punctuation apostrophe/right single quotation mark. In that case, all of them will generally be represented by U+2019.

Where you are parsing text where such distinctions are important, you will still need to look at the context around the characters to help disambiguate the relevant semantics."

 

Corrigendum

p 7-7. Change character 0027 informative notes, second bullet to:

"preferred character for apostrophe is either 02BC ‘MODIFIER LETTER APOSTROPHE or 2019 RIGHT SINGLE QUOTATION MARK (which also represents a punctuation apostrophe)."

 

Corrigendum

p 7-37. Change character 02BC informative notes, third bullet to:

"this is the preferred character for letter apostrophe."

 

Corrigendum

p 7-155. Change character 2019 informative notes, first bullet to:

"this is the preferred character for quotation mark and punctuation apostrophe."

4.7 Typographic Errata

The following are typographic errors in the text of the standard.

Corrigenda

pp 7-50..7-55. Change the page header to "0400...Cyrillic...04FF".

pp 7-66..7-70. Change the page header to "0600...Arabic...06FF".

4.8 Glyph Errata

A number of glyphs have been corrected. The corrections are given here and can be found on the Unicode Web site at:

http://www.unicode.org/unicode/uni2errata/UnicodeTypos.html

Additional glyph corrections will be posted to this site as available.

Corrigenda
05F1 HEBREW LIGATURE YIDDISH VAV YOD
2603 SNOWMAN
3085 HIRAGANA LETTER SMALL YU
FA0E UFA0D.gif (172 bytes) CJK Compatibility Ideograph
FA0F UFA0F.gif (171 bytes) CJK Compatibility Ideograph
FA10 UFA10.gif (173 bytes) CJK Compatibility Ideograph
FA11 UFA11.gif (181 bytes) CJK Compatibility Ideograph
FA12 UFA12.gif (172 bytes) CJK Compatibility Ideograph
FA13 UFA13.gif (175 bytes) CJK Compatibility Ideograph
FA14 UFA14.gif (174 bytes) CJK Compatibility Ideograph
FA15 UFA15.gif (181 bytes) CJK Compatibility Ideograph
FA16 UFA16.gif (174 bytes) CJK Compatibility Ideograph
FA17 UFA17.gif (164 bytes) CJK Compatibility Ideograph
FA18 UFA18.gif (168 bytes) CJK Compatibility Ideograph
FA19 UFA19.gif (174 bytes) CJK Compatibility Ideograph
FA1A UFA1A.gif (169 bytes) CJK Compatibility Ideograph
FA1B UFA1B.gif (171 bytes) CJK Compatibility Ideograph
FA1C UFA1C.gif (173 bytes) CJK Compatibility Ideograph
FA1D UFA1D.gif (178 bytes) CJK Compatibility Ideograph
FA1E UFA1E.gif (173 bytes) CJK Compatibility Ideograph
FA1F UFA1F.gif (174 bytes) CJK Compatibility Ideograph
FA20 UFA20.gif (177 bytes) CJK Compatibility Ideograph
FA21 UFA21.gif (174 bytes) CJK Compatibility Ideograph
FA22 UFA22.gif (175 bytes) CJK Compatibility Ideograph
FA23 UFA23.gif (170 bytes) CJK Compatibility Ideograph
FA24 UFA24.gif (167 bytes) CJK Compatibility Ideograph
FA25 UFA25.gif (172 bytes) CJK Compatibility Ideograph
FA26 UFA26.gif (176 bytes) CJK Compatibility Ideograph
FA27 UFA27.gif (175 bytes) CJK Compatibility Ideograph
FA28 UFA28.gif (174 bytes) CJK Compatibility Ideograph
FA29 UFA29.gif (175 bytes) CJK Compatibility Ideograph
FA2A UFA2A.gif (175 bytes) CJK Compatibility Ideograph
FA2B UFA2B.gif (176 bytes) CJK Compatibility Ideograph
FA2C UFA2C.gif (174 bytes) CJK Compatibility Ideograph
FA2D UFA2D.gif (181 bytes) CJK Compatibility Ideograph

4.9 UTF-7 Sample Code Correction

The UTF-7 specification was unclear on one point, which led to an error in the sample code for converting from UCS-2 to UTF-7. The problem occurs when U+002D HYPHEN-MINUS follows a character that must be encoded. Because ASCII 0x2D is the terminating character for an encoded sequence, two 0x2D characters must be output in order to preserve the U+002D when converting back to Unicode.

RFC 2152 has been published with a revised version of the UTF-7 specifications. The file included with the CD-ROM file has been updated with this fix.

Corrigenda

p A-5. The correction is in the code near the bottom of the page. The new text is highlighted.

if (!needshift)

{
    /* Write the explicit shift out character if
        1) The caller has requested that we always do it, or
        2) The directly encoded character is in the
        base64 set, or
        3) The directly encoded character is SHIFT_OUT.
        */

    if (verbose || ((!done) && (invbase64[r] >=0
       || r == SHIFT_OUT)))
    {
        TARGETCHECK;
        *target++ = SHIFT_OUT
    }
    shifted = 0;
}

5 Unicode Character Database and Properties Changes

In addition to including the properties for the object replacement character and the euro sign, the Unicode Technical Committee has approved changes to the Unicode Character Database to reconcile problems found in an analysis of the character categories, and to make new distinctions in the database for use in identifiers. The property changes reflect the following:

  1. Encoding of U+20AC EURO SIGN and U+FFFC OBJECT REPLACEMENT CHARACTER

  2. Removing space, white space and delimitation as characteristics of U+FEFF

  3. Narrowing the concept of white space to avoid miscellaneous ignorable Unicode controls and the Unicode NULL.

  4. Mandated changes in directional properties, expanded to compatibility forms for consistency

The details are given in the following table:
Space Remove FEFF
White space Remove 0000, 200C..200F,202A..202E, 206A..206F, FEFF
Punctuation Add 00B7
Delimiter Remove FEFF
Currency Symbol Add 20AC
Bidi: Left-to-Right Remove 0026, 0040, FE60, FE6B, FF06, FF20
Bidi: Eur Num Term Add 20AC
Bidi: Eur Num Sep Remove 002E, 2007, FE52, FF0E
Bidi: Common Sep Add 002E, 2007, FE52, FF0E
Bidi: Other Neutrals Add 0026, 0040, FE60, FE6B, FF06, FF20
Unassigned Code Value Remove 20AC, FFFC

This new information is reflected in the newest version of the Unicode Character Database and the additional properties files in the Unicode 2.1 Update directory on the unicode.org ftp site:

ftp://ftp.unicode.org/Public/2.1-Update/

The 2.1 files in the update directory supersede the three 2.0 files on the CD-ROM, which is distributed with The Unicode Standard, Version 2.0, and which are also available at:

ftp://ftp.unicode.org/Public/UNIDATA

A diff file cataloging the changes in the Unicode Character Database file is also available:

ftp://ftp.unicode.org/Public/2.1-Update/diff2014v211.txt

Revisions

Changes for Revision 3

Formatting corrections were made.

Changes for Revision 2

Correction of typographical and glyph errors as follows:

1. Typo in section 3.4 Identifier Errata, third line describing compatability low lines corrected to read FE33, not FF33.

2. Glyph for U+FA0E in section 3.8 corrected.

3. Under 3.4 Identifier Errata, in the small unlined table towards the bottom, under "Coverage," second entry, changed "Enclosing mark" to "Spacing combining mark."

Internal hyperlinks added at beginning of document.

Changes for Revision 1

Correction of two typographical errors as follows:

1.  In the section 3.9 "UTF-7 Sample Code Correction", in the sentence, " The problem occurs when U+200D HYPHEN-MINUS follows a character that must be encoded." "U+200D" corrected to read "U+002D".

2.  In section 3.6 the third corrigendum, "p 7-37. Change character 02BC informative notes, first bullet to:"   "first" corrected to read "third."


Copyright © 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.