|Version||1.0 - L2/04-339|
|Authors||Asmus Freytag (firstname.lastname@example.org), Mark Davis (email@example.com)|
This report defines named sequences of Unicode Characters
[ Many known issues. - Need overall direction, not so much detailed edits, from reviewers.]
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex / Unicode Technical Standard / Unicode Technical Report. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
|This document is a proposed draft / draft / proposed update / proposed update of a previously approved Unicode Standard Annex / Unicode Technical Standard / Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.|
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher. A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
This annex/technical report specifies sequences of characters that may be treated as a single units, either in particular types of processing, in reference by standards, in listing of repertoires (such as for fonts or keyboards), or in communicating with users.
Some standards, notably those developed by ISO/IEC JTC1/SC2 have a long standing tradition of using the formal name of a character as the means to identify corresponding characters across standards. With Unicode as the universal character set, this practice has largely given way to using the code point in Unicode as the unique identifier. However, some standards contain entities or characters that are mapped not to a single Unicode code point, but to a sequence of characters. In these instances it is convenient to have a name for the sequence.
Here are examples of such characters, and their representation as a sequence of code points.
[TBD formatting of tables, subsetting]
|Character||Code Points||Linguistic Usage|
|0063 0068||Slovak, traditional Spanish|
|0074 02B0||Native American languages|
|0069 0307 0301|
|30C8 309A||Ainu in kana transcription|
|17BB 17C6||khmer vowel sign srak om|
|17B6 17C6||khmer vowel sign srak am|
|17D2 1780||khmer consonant sign coeng ka|
|17D2 1781||khmer consonant sign coeng kha|
|17D2 1782||khmer consonant sign coeng ko|
|17D2 1783||khmer consonant sign coeng kho|
|17D2 1784||khmer consonant sign coeng ngo|
|17D2 1785||khmer consonant sign coeng ca|
|17D2 1786||khmer consonant sign coeng cha|
|17D2 1787||khmer consonant sign coeng co|
|17D2 1788||khmer consonant sign coeng cho|
|17D2 1789||khmer consonant sign coeng nyo|
|17D2 178A||khmer consonant sign coeng da|
|17D2 178B||khmer consonant sign coeng ttha|
|17D2 178C||khmer consonant sign coeng do|
|17D2 178D||khmer consonant sign coeng ttho|
|17D2 178E||khmer consonant sign coeng na|
|17D2 178F||khmer consonant sign coeng ta|
|17D2 1790||khmer consonant sign coeng tha|
|17D2 1791||khmer consonant sign coeng to|
|17D2 1792||khmer consonant sign coeng tho|
|17D2 1793||khmer consonant sign coeng no|
|17D2 1794||khmer consonant sign coeng ba|
|17D2 1795||khmer consonant sign coeng pha|
|17D2 1796||khmer consonant sign coeng po|
|17D2 1797||khmer consonant sign coeng pho|
|17D2 1798||khmer consonant sign coeng mo|
|17D2 1799||khmer consonant sign coeng yo|
|17D2 179A||khmer consonant sign coeng ro|
|17D2 179B||khmer consonant sign coeng lo|
|17D2 179C||khmer consonant sign coeng vo|
|17D2 179D||khmer consonant sign coeng sha|
|17D2 179E||khmer consonant sign coeng ssa|
|17D2 179F||khmer consonant sign coeng sa|
|17D2 17A0||khmer consonant sign coeng ha|
|17D2 17A2||khmer consonant sign coeng qa|
|17D2 17A7||khmer vowel sign coeng qu|
|17D2 17AB||khmer vowel sign coeng ry|
|17D2 17AF||khmer vowel sign coeng qe|
While all combinations of accents and base characters are encodable in Unicode, not all combinations are required for particular purpose and only some may be supported. Named sequences would be useful in these contexts to have a shorthand to refer to a specific sequence. However, while rhere are many sequences of characters that get special treatment that varies by language, such as sequences of characters that are collated as a single units, not all such sequences necessarily need to be named.
The standard notation for a sequence of characters defined by the Unicode Standard is
<HHHH, HHHH, ....HHHH>
where HHHH is a sequence of 4-6 upper case hexadecimal digits, optionally preceded by "U+".
[TBD: edit boilerplate]
|Conformance to the Unicode Standard does not require conformance to the specification in this document.|
|Conformance to the Unicode Standard does not require / requires conformance to the specification in this document. The relationship between conformance to the Unicode Standard, and conformance to an individual Unicode Standard Annex (UAX) is described in more detail in the Unicode Standard in Section 3.2 Conformance.|
|Unicode-conformant implementation that implement this specification must do so as described in the following clause.|
|There are many different ways to [break lines of text], and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement this specification must do so as described in the following clause. Implementations are free to deviate from this, as long as they do not purport to conform to this specification.|
|C1||An implementation that claims conformance
to the default [Unicode Line Break
Algorithm] shall produce the same results as the
algorithm published in this specification.
|C2||This specification defines default
behavior, which is to be used in the absence of tailoring
for particular languages and environments.
|C3||If tailoring is used by an implementation that
claims conformance to the default [Unicode Line Break Algorithm],
the existence of such tailoring must be documented.
At times, this specification recommends best practice. These recommendations are not normative and conformance with this specification does not depend on their realization. These recommendations contain the expression "We recommend ...", "This specification recommends ...", or some similar wording.
Names of Unicode sequences are unique. Where possible, they are constructed by appending the names of the constituent elements together while eliding duplicate elements. If a sequence would have a name that already exists, then the name is modified suitably to avoid the clash. Names of sequences are enclosed in <>.
<A, B, C> <LATIN LETTER CAPITAL A B C
<AE, F> <LATIN LETTER CAPITAL A E F>
<A, E, F> <LATIN LETTER CAPITAL A WITH E AND F>
<A COMBINING DIACRITIC ABOVE> <A WITH DIACRITIC ABOVE>
A data file is available. See [Composite].
|[Charts]||The online code charts can be found at http://www.unicode.org/charts/
An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html
[Tentative]Named composite entities data file
For the latest version, see:
For other versions, see:
|[Feedback]||Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html|
|[FAQ]||Unicode Frequently Asked Questions
For answers to common questions on technical issues.
For explanations of terminology used in this and other documents.
|[Normal]||Unicode Technical Report #15: Unicode Normalization Forms http://www.unicode.org/unicode/reports/tr15/|
Technical Standard #18: Regular Expressions
|[Reports]||Unicode Technical Reports
For information on the status and development process for technical reports, and for a list of technical reports.
|[Scripts]||Scripts data file
For the latest version, see:
For other versions, see: http://www.unicode.org/standard/versions/
|[UCD]||Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files|
|[Unicode]||The Unicode Standard
For the latest version see: http://www.unicode.org/versions/latest/.
For the current version see: http://www.unicode.org/versions/Unicode4.1.0/.
For the last major version see:
The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1).
|[Versions]||Versions of the Unicode Standard
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
The following summarizes modifications from the previous version of this document.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.