Unicode 3.0.1
Version 3.0.1 has been superseded by the
latest version
of the Unicode Standard.
Version 3.0.1 of the Unicode Standard consists of the core specification,
The Unicode Standard
Version 3.0, as modified by the textual updates specified on
this page, the code charts for this version (currently only available in hard copy),
the Unicode Standard Annexes,
and the 3.0.1 Update of the Unicode Character Database (UCD).
A complete specification of the contributory files for Unicode 3.0.1
is found on the page
Components for 3.0.1. That page also provides the recommended
reference format for this version of the Unicode Standard.
Contents of This Document
Unicode 3.0.1 is an
update version.
It does not contain character additions or major normative changes. See also
Corrigendum #1
for an important corrigendum applicable to Unicode 3.0.1 which modifies
the conformance requirements for UTF-8.
There are two new categories of approved Unicode Technical
Reports that have been given more authoritative status by the
Unicode Technical Committee: Unicode Standard Annex (UAX)
and Unicode Technical Standard (UTS).
For more information, see
About Unicode
Technical Reports.
Several of
the Unicode Standard Annexes have also been updated in this version of Unicode. Of
particular interest is the conformance test for normalization.
Three new data files have been added to the Unicode 3.0.1
release:
BidiMirroring.txt (UAX #9: The
Bidirectional Algorithm)
- Informative properties for substituting characters in an
implementation of bidirectional mirroring.
CaseFolding.txt (UTR #21: Case Mappings)
- Informative file mapping characters to their case-folded
form.
NormalizationTest.txt (UAX #15: Unicode
Normalization Forms)
- Normative test file for conformance to Unicode Normalization
Forms.
In each new version of the Unicode Standard, the Unicode
Consortium may add characters or make certain changes in characters that were
encoded in a previous version of the standard. To minimize the
impact on existing implementations, however, there are limitations
imposed by the consortium on the types of changes that can be made.
See Unicode
Character Encoding Stability Policy for more information.
The following describes the textual updates that
have been made. All references to sections and page numbers
are to The Unicode
Standard, Version 3.0. For detailed changes,
struck-through indicates deleted text; underline
indicates added text.
Section 0.2, Notational Conventions, page xxviii:
change the description of the U+ notation to read:
In running text, an individual Unicode code point can be
expressed as U+n, where n is from four to six
hexadecimal digits, using the digits 0-9 and A-F (for 10 through
15, respectively). There should be no leading zeros, unless the
codepoint would have fewer than four hexadecimal digits; for
example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.
There are 34 distinguished code points in Unicode that are
characterized as noncharacters. To clarify the status of all
34, a definition (page 41) is added, and conformance rules C5 and
C10 (pages 38, 39) are amended as follows:
D7b |
Noncharacter: a code point that is
permanently reserved for internal use, and that should never
be interchanged. In Unicode 3.0, these consist of the values
U+nFFFE and U+nFFFF, where n is from 0 to
1016. |
- For more information, see the discussions under "Special
Noncharacter Values" in Section 2.7 Special Character and
Noncharacter Values, and under "Noncharacters" in Section
13.6 Specials.
- These code points are permanently reserved as
noncharacters. In the future, it is possible that additional
code points may be specified to represent noncharacters.
C5 |
A process shall not interpret either
U+FFFE or U+FFFF a noncharacter code point
as an abstract character. |
- The code points may be used internally, such as for
sentinel values or delimiters, but should not be exchanged
publicly.
C10 |
A process shall make no change in a valid
coded character representation other than the possible
replacement of character sequences by their
canonical-equivalent sequences or the deletion of
noncharacter code points, if that process purports not
to modify the interpretation of that coded character sequence. |
- If a noncharacter which does not have a specific internal
use is unexpectedly encountered in processing, an implementation
may signal an error or delete or ignore the noncharacter. If
these options are not taken, the noncharacter should be treated
as an unassigned code point.
Section 5.3, Unknown and Missing Characters: Unassigned and
Private Use Character Codes, page 108-109: add the following to
the end of the subsection:
In practice, applications must deal with unassigned code points
or unknown private use characters. This may occur, for example,
when the application is handling text that originated on a system
implementing a later release of Unicode, with additional assigned
characters. To work properly in implementations, unassigned code
points must be given default properties as if they were
characters, since various algorithms require properties to be
assigned to every character in order to function at all. These
properties are not uniform across all unassigned code points,
since certain ranges of code points need different properties to
maximize compatibility.
Normally, code points outside the repertoire of supported
characters would be displayed with a fall-back glyph, such as a
black box. However, format and control characters must not have
visible glyphs (although they may have an effect on other
characters in display). These characters are also ignored except
with respect to specific, defined processes: for example, ZERO
WIDTH NON-JOINER is ignored in collation. To allow a greater
degree of compatibility across versions of the standard, the
ranges U+2060..U+2069 and U+E0000..U+E1000 are reserved for future
format and control characters (General Category = Cf). Unassigned
code points in these ranges should be ignored in processing and
display.
The Unicode Bidirectional Algorithm assigns a
Bidirectional Category to
unassigned code points based on the expected direction of
characters to be added in the future. For more information, see
Bidirectional Character
Types in UAX
#9: The Bidirectional Algorithm.
UAX #14: Line
Breaking Properties supplies the property "XX" for all
unassigned code points in Definitions.
In determining character widths for East Asian display,
UAX #11: East Asian
Width includes a section on Unassigned
and Private Use characters.
In normalization, unassigned code points are given the
Canonical Combining Class = 0, and no decomposition mapping.
Section 5.16, Identifiers: Specific Character Additions,
page 134: the subsection name is changed to Specific Character
Adjustments, and the following note is added:
Note: a useful set of characters to consider for
exclusion from identifiers consists of all characters whose
compatibility mappings have a <font>
tag.
Section 6.1, General Punctuation, Punctuation: U+0020-U+00BF,
page 149: the following note is added:
Note: any of the characters U+002E, U+002C, U+060C,
U+066B, or U+066C (and possibly others) can be used as numeric
separator characters, depending on the locale and user
customizations.
In some orthographies the same letters may either ligate or not,
depending on the intended reading. To account for this, the
semantics of the ZWNJ and ZWJ have been extended.
Section 13.2 Controlling Ligatures, page 318: the
text is superseded by the following:
To allow for finer control over ligature formation, in Unicode
3.0.1 the definitions of the following characters has been
broadened to cover ligatures as well as cursive connection:
U+200C ZERO WIDTH NON-JOINER
- The intended semantic is to break both cursive connections
and ligatures in rendering.
U+200D ZERO WIDTH JOINER
- The intended semantic is to produce a more connected
rendering of adjacent characters than would otherwise be the
case, if possible. In particular:
- If the two characters could form a ligature, but do not
normally, ZWJ requests that the ligature be used.
- Otherwise, if either of the characters could cursively
connect, but do not normally, ZWJ requests that each of the
characters take a cursive-connection form where possible.
- In particular, if a character X on one side has a
cursive form, and the other character Y does not, ZWJ
requests that X take a cursive form.
- Otherwise, where neither a ligature nor cursive connection
are available, the ZWJ has no effect.
In other words given three broad categories below, ZWJ requests
that glyphs in the highest available category (for the given font)
be used; ZWNJ requests that glyphs in the lowest available
category (for the given font) be used:
- unconnected
- cursively connected
- ligated
For those unusual circumstances where someone wants to forbid
ligatures in a sequence XY, but promote cursive connection, the
sequence X<zwj><zwnj><zwj>Y can be used. The <zwnj> breaks
ligatures, while the two adjacent joiners cause the X and Y to
take adjacent cursive forms (where they exist). Similarly, if
someone wanted to have X take a cursive form but Y be isolated,
then the sequence X<zwj><zwnj>Y could be used (as in previous
versions of Unicode).
Examples
The following provide samples of desired renderings when the
joiner or non-joiner are inserted between two characters. In the
Arabic examples, the characters on the left side are in visual
order already, but have not yet been shaped.
Sample Display Actions
 |
Implementation Notes
For modern font technologies, such as OpenType or AAT, font
vendors should add ZWJ to their ligature mapping tables as
appropriate. Thus where a font had a mapping from "f" + "i"
to
, the font designer should add the
additional mapping from "f" + ZWJ + "i"
to
. On the other hand, ZWNJ will normally
have the desired effect naturally for most fonts without any
change, since it simply obstructs the normal ligature/cursive
connection behavior. As with all other alternate format
characters, fonts should use an invisible zero-width glyph for
representation of both ZWJ and ZWNJ.
Current Arabic shaping algorithms should need no change; optional
ligatures just would not be promoted by ZWJ, but current text
should not be affected. The reason is that the current use of ZWJ
between characters that normally cursively connect was redundant
in previous versions of Unicode and should occur in very few
instances. (As a matter of fact, with bad implementations of ZWJ
or with unsupported ZWJ, the cursive connection would actually be
broken.)
Last updated:
- 1/11/2011, 8:31:25 PM
- Contact Us