February 8, 2012
According to the discussion of L2/11-449 Extensions to Ideographic Description Sequences in meeting, here are proposed and
deletions to Section 12.2 of TUS and Annex I of ISO/IEC 10646. The intent of the changes is to extend the definition of IDS to achieve the same result as L2/11-449 achieves via extended IDS.
Although the Unicode Standard includes more than 75,000 CJK unified ideographs, thousands of extremely rare CJK ideographs have nevertheless been left unencoded. Research into cataloging additional ideographs for encoding continues, but it is anticipated that at no point will the entire set of potential, encodable ideographs be completely exhausted. In particular, ideographs continue to be coined and such new coinages will invariably be unencoded.
The 12 characters in the Ideographic Description block provide a mechanism for the standard interchange of text that must reference unencoded ideographs. Unencoded ideographs can be described using these characters and encoded ideographs; the reader can then create a mental picture of the ideographs from the description.
This process is different from a formal encoding of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideographic descriptions are more akin to the English phrase “an ‘e’ with an acute accent on it” than to the character sequence <U+0065, U+0301>.
In particular, support for the characters in the Ideographic Description block does not require the rendering engine to recreate the graphic appearance of the described character.
Note also that many of the ideographs that users might represent using the Ideographic Description characters will be formally encoded in future versions of the Unicode Standard.
The Ideographic Description Algorithm depends on the fact that virtually all CJK ideographs can be broken down into smaller pieces that are themselves ideographs. The broad coverage of the ideographs already encoded in the Unicode Standard implies that the vast majority of unencoded ideographs can be represented using the Ideographic Description characters.
Although Ideographic Description Sequences are intended primarily to represent unencoded ideographs and should not be used in data interchange to represent encoded ideographs, they also have pedagogical and analytic uses. A researcher, for example, may choose to represent the character U+86D9 蛙 as “蛙” in a database to provide a link between it and other characters sharing its phonetic, such as U+5A03 娃. The IRG is using Ideographic Description Sequences in this fashion to help provide a first-approximation, machine-generated set of unifications for its current work.
Applicability to Other Scripts. The characters in the
Ideographic Description block are derived from a Chinese standard and
were encoded for use specifically in describing CJK ideographs. As a
result, the following detailed description of Ideographic Description
Sequences is specified entirely in terms of CJK unified ideographs and
CJK radicals. However, there are several large, historic East Asian
scripts whose writing systems were heavily influenced by the Han
script. Like the Han script, those siniform historic scripts, which
include Tangut, Jurchen, and Khitan, are logographic in
nature. Furthermore, they built up characters using radicals and
components, and with side-by-side and top-to-bottom stacking very
similar in structure to the way CJK ideographs are composed. These
historic scripts are not yet encoded in Version
6.0 of the Unicode
Standard, but it is quite likely that one or more of them will be
The general usefulness of Ideographic Description Sequences for describing unencoded characters and the applicability of the characters in the Ideographic Description block to description of siniform logographs mean that the syntax for Ideographic Description Sequences can be generalized to extend to additional East Asian logographic scripts.
Ideographic Description Sequences. Ideographic Description
Sequences are defined by the following grammar. The list of characters
associated with the
Unified_CJK_Ideograph and CJK_Radical properties
can be found in the Unicode Character Database. See Appendix A,
Notational Conventions, for the notational conventions used here.
IDS := Unified_CJK_Ideograph | CJK_Radical || IDS_BinaryOperator IDS IDS | IDS_TrinaryOperator IDS IDS IDS IDS_BinaryOperator := U+2FF0 | U+2FF1 | U+2FF4 | U+2FF5 | U+2FF6 | U+2FF7 | U+2FF8 | U+2FF9 | U+2FFA | U+2FFB IDS_TrinaryOperator:= U+2FF2 | U+2FF3 In addition to the above grammar, Ideographic Description Sequences have two other
length constraints: A sequence of characters that includes Ideographic Description
characters but does not conform to the grammar and length constraints
described here is not an Ideographic Description Sequence.
The operators indicate the relative graphic positions of the operands running from left to right and from top to bottom.
Non-unique compatibility ideographs (U+F900..U+FA6B and
U+2F800..U+2FA1D, but not U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14,
U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, or U+FA29) are not
counted as unified ideographs for the purposes of this grammar,
although they do have the ideographic property (see Section 4.10,
Letters, Alphabetic, and Ideographic). These ideographs are excluded
from Ideographic Description Sequences to incrementally reduce the
ambiguity of such sequences. Non-unique compatibility ideographs have
canonical equivalences and are excluded on that basis. Some
CJK_Radical characters have compatibility equivalences to unified
ideographs, but compatibility equivalence is not considered a basis
for exclusion from Ideographic Description Sequences, because the
shape differences involved may be relevant to description of the forms
of unencoded ideographs.
A user wishing to represent an unencoded ideograph will need to analyze its structure to determine how to describe it using an Ideographic Description Sequence. As a rule, it is best to use the natural radical-phonetic division for an ideograph if it has one and to use as short a description sequence as possible; however, there is no requirement that these rules be followed. Beyond that, the shortest possible Ideographic Description Sequence is preferred.
The length constraints allow random access into a string of
ideographs to have well-defined limits. Only a small number of
characters need to be scanned backward to determine whether those
characters are part of an Ideographic Description Sequence. The fact that Ideographic Description Sequences can contain other
Ideographic Description Sequences means that implementations may need
to be aware of the recursion depth of a sequence and its back-scan
length. The recursion depth of an Ideographic Description Sequence is
the maximum number of pending operations encountered in the process of
parsing an Ideographic Description Sequence. In Figure 12-8, the
maximum recursion depth is shown in the eleventh example, where four
operations are still pending at the end of the Ideographic Description
Sequence. The back-scan length is the maximum number of ideographs unbroken
by Ideographic Description characters in the sequence. None of the
examples in Figure 12-8 has more than six ideographs in a row; for
many, the back-scan length is two. The Unicode Standard places no formal limits on the recursion depth
of Ideographic Description Sequences. It does, however, limit the
back-scan length for valid Ideographic Description Sequences to be six
Examples 9–13 illustrate more complex Ideographic Description Sequences showing the use of some of the less common operators.
Equivalence. Many unencoded ideographs can be described in more than one way using this algorithm, either because the pieces of a description can themselves be broken down further (examples 1–3 in Figure 12-8) or because duplications appear within the Unicode Standard (examples 5 and 6 in Figure 12-8).
The Unicode Standard does not define equivalence for two Ideographic Description Sequences that are not identical. Figure 12-8 contains numerous examples illustrating how different Ideographic Description Sequences might be used to describe the same ideograph.
In particular, Ideographic Description Sequences should not be used to provide alternative graphic representations of encoded ideographs in data interchange. Searching, collation, and other content-based text operations would then fail.
Interaction with the Ideographic Variation Mark. As with ideographs proper, the Ideographic Variation Mark (U+303E) may be placed before an Ideographic Description Sequence to indicate that the description is merely an approximation of the original ideograph desired. A sequence of characters that includes an Ideographic Variation Mark is not an Ideographic Description Sequence.
Rendering. Ideographic Description characters are visible characters and are not to be treated as control characters. Thus the sequence U+2FF1 U+4E95 U+86D9 must have a distinct appearance from U+4E95 U+86D9.
An implementation may render a valid Ideographic Description Sequence either by rendering the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described. In the latter case, the Ideographic Description Sequence should be treated as a ligature of the individual characters for purposes of hit testing, cursor movement, and other user interface operations. (See Section 5.11, Editing and Selection.)
Character Boundaries. Ideographic Description characters are not combining characters, and there is no requirement that they affect character or word boundaries. Thus U+2FF1 U+4E95 U+86D9 may be treated as a sequence of three characters or even three words.
Implementations of the Unicode Standard may choose to parse Ideographic Description Sequences when calculating word and character boundaries. Note that such a decision will make the algorithms involved significantly more complicated and slower.
Standards. The Ideographic Description characters are found in GBK—an extension to GB 2312-80 that adds all Unicode ideographs not already in GB 2312-80. GBK is defined as a normative annex of GB 13000.1-93.
An Ideographic Description Character (IDC) is a graphic character, which is used with a sequence of other graphic characters to form an Ideographic Description Sequence (IDS). Such a sequence may be used to describe an ideographic character which is not specified within this International Standard.
The IDS describes the ideograph in the abstract form. It is not interpreted as a composed character and does not imply any specific form of rendering.
NOTE – An IDS is not a character and therefore is not a member of the repertoire of this International Standard.
An IDS consists of an IDC followed by a fixed number of Description Components (DC). A DC may be any one of the following:
NOTE 1 – The above description implies that any IDS may be nested within another IDS.
Each IDC has four properties as summarized in table I.1 below;
The syntax of the IDS introduced by each IDC is indicated in the ―IDS Acronym and Syntax‖ column of the table by the abbreviated name of the IDC (e.g. IDC-LTR) followed by the corresponding number of DCs, i.e. (D1 D2) or (D1 D2 D3).
NOTE 2 – An IDS is restricted to no more than 16 characters in length. Also no more than six ideographs and/or radicals may
occur between any two instances of an IDC character within an IDS.
IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT (2FF0): The IDS introduced by this character describes the abstract form of the ideograph with D1 on the left and D2 on the right.
IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW (2FF1): The IDS introduced by this character describes the abstract form of the ideograph with D1 above D2.
IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT (2FF2): The IDS introduced by this character describes the abstract form of the ideograph with D1 on the left of D2, and D2 on the left of D3.
IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW (2FF3): The IDS introduced by this character describes the abstract form of the ideograph with D1 above D2, and D2 above D3.
IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND (2FF4): The IDS introduced by this character describes the abstract form of the ideograph with D1 surrounding D2.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE (2FF5): The IDS introduced by this character describes the abstract form of the ideograph with D1 above D2, and surrounding D2 on both sides.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW (2FF6): The IDS introduced by this character describes the abstract form of the ideograph with D1 below D2, and surrounding D2 on both sides.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT (2FF7): The IDS introduced by this character describes the abstract form of the ideograph with D1 on the left of D2, and surrounding D2 above and below.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT (2FF8): The IDS introduced by this character describes the abstract form of the ideograph with D1 at the top left corner of D2, and partly surrounding D2 above and to the left.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT (2FF9): The IDS introduced by this character describes the abstract form of the ideograph with D1 at the top right corner of D2, and partly surrounding D2 above and to the right.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT (2FFA): The IDS introduced by this character describes the abstract form of the ideograph with D1 at the bottom left corner of D2, and partly surrounding D2 below and to the left.
IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (2FFB): The IDS introduced by this character describes the abstract form of the ideograph with D1 and D2 overlaying each other.